You are on page 1of 610

HANDBOOK OF RESEARCH SYNTHESIS AND META-ANALYSIS

2ND EDITION
HANDBOOK
THE

OF RESEARCH
SYNTHESIS
META-ANALYSIS
AND

2ND EDITION

Edited by

HARRIS COOPER, LARRY V. HEDGES,

AND JEFFREY C. VALENTINE

RUSSELL SAGE FOUNDATION NEW YORK


THE RUSSELL SAGE FOUNDATION

The Russell Sage Foundation, one of the oldest of Americas general purpose foundations, was
established in 1907 by Mrs. Margaret Olivia Sage for the improvement of social and living con-
ditions in the United States. The Foundation seeks to fulfill this mandate by fostering the devel-
opment and dissemination of knowledge about the countrys political, social, and economic prob-
lems. While the Foundation endeavors to assure the accuracy and objectivity of each book it
publishes, the conclusions and interpretations in Russell Sage Foundation publications are those
of the authors and not of the Foundation, its Trustees, or its staff. Publication by Russell Sage,
therefore, does not imply Foundation endorsement.

BOARD OF TRUSTEES
Mary C. Waters, Chair

Kenneth D. Brody Kathleen Hall Jamieson Nancy Rosenblum


W. Bowman Cutter, III Melvin J. Konner Shelley E. Taylor
Christopher Edley Jr. Alan B. Krueger Richard H. Thaler
John A. Ferejohn Cora B. Marrett Eric Wanner
Larry V. Hedges

Library of Congress Cataloging-in-Publication Data

The handbook of research synthesis and meta-analysis / Harris Cooper, Larry V. Hedges,
Jeffrey C. Valentine, editors. 2nd ed.
p. cm.
Rev. ed. of: The handbook of research synthesis. c1994.
Includes bibliographical references and index.
ISBN 978-0-87154-163-5
1. ResearchMethodologyHandbooks, manuals, etc. 2. Information storage and
retrieval systemsResearchHandbooks, manuals, etc. 3. ResearchStatistical methods
Handbooks, manuals, etc. I. Cooper, Harris M. II. Hedges, Larry V. III. Valentine, Jeff C.
IV. Handbook of research synthesis.
Q180.55.M4H35 2009
001.42dc22 2008046477

Copyright 2009 by Russell Sage Foundation. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmit-
ted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior written permission of the publisher.

Reproduction by the United States Government in whole or in part is permitted for any purpose.

The paper used in this publication meets the minimum requirements of American National Stan-
dard for Information SciencesPermanence of Paper for Printed Library Materials. ANSI
Z39.48-1992.

RUSSELL SAGE FOUNDATION


112 East 64th Street, New York, New York 10065
10 9 8 7 6 5 4 3 2 1
To the authors of the Handbook chapters,
for their remarkable cooperation and commitment
to the advance of science and society
CONTENTS

Foreword xi
About the Authors xv

PART I INTRODUCTION 1
1. Research Synthesis as a Scientific Process 3
Harris Cooper and Larry V. Hedges

PART II FORMULATING A PROBLEM 17


2. Hypotheses and Problems in Research Synthesis 19
Harris Cooper
3. Statistical Considerations 37
Larry V. Hedges

PART III SEARCHING THE LITERATURE 49


4. Scientific Communication and Literature Retrieval 51
Howard D. White
5. Using Reference Databases 73
Jeffrey G. Reed and Pam M. Baxter
6. Grey Literature 103
Hannah R. Rothstein and Sally Hopewell

PART IV CODING THE LITERATURE 127


7. Judging the Quality of Primary Research 129
Jeffrey C. Valentine

vii
viii CONTENTS

8. Identifying Interesting Variables and Analysis Opportunities 147


Mark W. Lipsey
9. Systematic Coding 159
David B. Wilson
10. Evaluating Coding Decisions 177
Robert G. Orwin and Jack L. Vevea

PART V STATISTICALLY DESCRIBING STUDY OUTCOMES 205


11. Vote-Counting Procedures in Meta-Analysis 207
Brad J. Bushman and Morgan C. Wang
12. Effect Sizes for Continuous Data 221
Michael Borenstein
13. Effect Sizes for Dichotomous Data 237
Joseph L. Fleiss and Jesse A. Berlin

PART VI STATISTICALLY COMBINING EFFECT SIZES 255


14. Combining Estimates of Effect Size 257
William R. Shadish and C. Keith Haddock
15. Analyzing Effect Sizes: Fixed-Effects Models 279
Spyros Konstantopoulos and Larry V. Hedges
16. Analyzing Effect Sizes: Random-Effects Models 295
Stephen W. Raudenbush
17. Correcting for the Distorting Effects of Study Artifacts in
Meta-Analysis 317
Frank L. Schmidt, Huy Le, and In-Sue Oh

PART VII SPECIAL STATISTICAL ISSUES AND PROBLEMS 335


18. Effect Sizes in Nested Designs 337
Larry V. Hedges
19. Stochastically Dependent Effect Sizes 357
Leon J. Gleser and Ingram Olkin
20. Model-Based Meta-Analysis 377
Betsy Jane Becker

PART VIII DATA INTERPRETATION 397


21. Handling Missing Data 399
Therese D. Pigott
22. Sensitivity Analysis and Diagnostics 417
Joel B. Greenhouse and Satish Iyengar
23. Publication Bias 435
Alex J. Sutton
CONTENTS iX

PART IX TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES 453


24. Advantages of Certainty and Uncertainty 455
Wendy Wood and Alice H. Eagly
25. Research Synthesis and Public Policy 473
David S. Cordray and Paul Morphy

PART X REPORTING THE RESULTS 495


26. Visual and Narrative Interpretation 497
Geoffrey D. Borman and Jeffrey A. Grigg
27. Reporting Format 521
Mike Clarke

PART XI SUMMARY 535


28. Threats to the Validity of Generalized Inferences 537
Georg E. Matt and Thomas D. Cook
29. Potentials and Limitations 561
Harris Cooper and Larry V. Hedges

Glossary 573
Appendix A 585
Appendix B 591
Index 595
FOREWORD

The past decade has been a time of enormous growth in thesis. This led to the creation of several major divisions
the need for and use of research syntheses in the social of the text as well as specific chaptersfor example,
and behavioral sciences. The need was given impetus by vote-counting procedures, combining effect sizes, and
the increase in social research that began in the 1960s and correcting for sources of artifactual variation across stud-
continues unabated today. Over the past three decades, ies. Next, we examined the issues involved within the
methods for the retrieval and analysis of research litera- major tasks and broke them down so that it would be
tures have undergone enormous change. Literature re- manageable to cover the required information within a
viewing used to be a narrative and subjective process. single chapter. Thus, within the topic of evaluating the
Today, research synthesis has a body of procedural and literature, separate chapters are devoted to judging re-
statistical techniques of its own. Because of the potential search quality and evaluating the reliability of coding de-
pitfalls that confront research synthesists, it is critical that cisions, among others. We supplemented the resulting list
the methods used to integrate the results of studies be rig- of chapters with others on special topics that cut across
orous and open to public inspection. the various stages of research synthesis, but were impor-
Like its predecessor, the second edition of the Hand- tant enough to merit focused attention. These topics in-
book of Research Synthesis and Meta-Analysis is an at- cluded missing data, dependent effect sizes, and using re-
tempt to bring together in a single volume state-of-the-art search synthesis to develop theories. For this edition,
descriptions of every phase of the research synthesis pro- twenty-nine chapters emerged with forty authors drawn
cess, from problem formulation to report writing. Readers from at least five distinct disciplines: information sci-
will find that it is comprised of chapters intended for use ence, statistics, psychology, education, and social policy
by scholars very new to the research synthesis enterprise analysis. Nineteen of the chapter authors also served as
(for example, by helping them think about how to formu- peer reviewers for other chapters along with twenty-one
late a problem and how to construct a coding guide) as scholars who took part in the review process only.
well as by experienced research synthesists looking for an- In the course of attempting to coordinate such an ef-
swers to highly technical questions (for example, options fort, we were faced with numerous editorial decisions.
for addressing stochastically dependent effect sizes). Perhaps the most critical involved how to handle overlap
The content of this second edition of the Handbook is in coverage and content across chapters. Because we
modeled on that of the very successful first edition. De- wanted the Handbook to be exhaustive, we defined topics
termining this content required us to use multiple deci- for chapters narrowly, but anticipated that authors would
sion criteria. First, by relying on personal experience and likely define their domains more broadly. They did. The
knowledge of previous books in the area, we identified issue for us, then, became whether to edit out the redun-
the central decision points in conducting a research syn- dancy or leave it in. We chose to leave it in for several

xi
xii FOREWORD

reasons. First, there is considerable insight to be gained the numerical results given in the examples. Specifically,
when the same issue is treated from multiple perspec- there are several possible ways that numerical results can
tives. Second, the repetition of topics helps readers un- be rounded for presentation. For example, when the final
derstand some of the most pervasive problems that con- computations involve sums or quotients of intermediate
front a research synthesist. Finally, the overlap allows results, the final numerical answer depends somewhat on
chapters to stand on their own, as more complete discus- whether and how the intermediate results are rounded. In
sions of each topic, thus increasing their individual use- the statistical examples, we have usually chosen to round
fulness for teaching. intermediate computations, such as sums, to two places.
Another important issue involved mathematical nota- Thus, hand computations using the intermediate values
tion. Although some notation in meta-analysis has be- should yield results corresponding to our examples. If
come relatively standard, few symbols are used univer- computations are carried out directly using a different
sally. In addition, because each statistical chapter contains rounding convention, the results might differ somewhat
several unique mathematical concepts, we had no guaran- from those given. Readers who experience slight discrep-
tee that the same symbols might not be used to represent ancies between their computations and those presented in
different concepts in different chapters. To minimize these the text should not be alarmed.
problems, we proposed that authors use the notational The Russell Sage Foundation has been providing sup-
system that was developed for the first edition of the port for more than twenty years to efforts that lead directly
Handbook. This system is not perfect, however, and the to the production of this handbook. The Foundation, along
reader may discover, when moving from one chapter to with its president, Eric Wanner, initiated a program to im-
the next, that the same symbol has been used to represent prove and disseminate state-of-the-art methods in research
different concepts or that different symbols are used for synthesis. This is the fourth volume to emerge from that
the same concept. However, each chapter introduces its effort. The first, The Future of Meta-Analysis, was pub-
notation, and we have worked to standardize the notation lished by the Foundation in 1990. It contained the pro-
as much as possible in order to minimize any confusion. ceedings of a workshop convened by the Committee on
In addition to the difficulties associated with mathe- National Statistics of the National Research Council. The
matical notation, given the number of diverse disciplines second volume, Meta-Analysis for Explanation: A Case-
contributing to the Handbook, we were faced with prob- book, presented a series of four applications of meta-anal-
lems involving more general terminology. Again, we ex- ysis to highlight the potential of meta-analysis for gener-
pected that authors would occasionally use different terms ating and testing explanations of social phenomena as
for the same concept and the same terms for different well as testing the effectiveness of social and medical pro-
concepts. Further, some concepts would be new to read- grams. The third volume was the first edition of this
ers in any particular discipline. Therefore, we asked au- handbook.
thors to contribute to a glossary by defining key terms in An effort as large as the Handbook cannot be under-
their chapter, which we edited for style and overlap. The taken without the help of many individuals. This volume
glossary for this second edition of the Handbook is signif- is dedicated to the authors of the chapters. These scholars
icantly updated from the first editions. Such instruments were ably assisted by reviewers, some of whom authored
are always flawed and incomplete, but we hope that this other chapters in the Handbook and some of whom kindly
one facilitates the readers passage through the book. lent their expertise to the project. The chapter reviewers
Related to this notion, this edition of the Handbook were:
uses the Chicago Manual of Style, required of all books
published by the Russell Sage Foundation. Readers more Dolores Albarracin, Department of Psychology, Uni-
familiar with the style used by the American Psychologi- versity of Illinois at Urbana-Champaign
cal Association will notice that first names appear in the Pam M. Baxter, Cornell Institute for Social and Eco-
references and the first time authors are mentioned in the nomic Research, Cornell University
text and the dates of publications are presented at the end
of sentences rather than directly following the names of Betsy Jane Becker, College of Education, Florida State
the authors. University
Finally, a consideration that arises in any report of Jesse A. Berlin, Johnson & Johnson Pharmaceutical
quantitative work concerns the conventions for deriving Research and Development, Titusville, NJ
FOREWORD xiii

Michael Borenstein, Biostat, Englewood, New Jersey Mark Petticrew, Department of Public Health and Pol-
Geoffrey D. Borman, School of Education, University icy, London School of Hygiene and Tropical Medi-
of Wisconsin, Madison cine

Robert F. Boruch, Graduate School of Education, Uni- Therese D. Pigott, School of Education, Loyola Uni-
versity of Pennsylvania versity Chicago

Brad J. Bushman, Institute for Social Research, Uni- Andrew C. Porter, Graduate School of Education, Uni-
versity of Michigan, and VU University, Amsterdam, versity of Pennsylvania
the Netherlands Stephen W. Raudenbush, Department of Sociology,
Mike Clarke, U.K. Cochrane Centre, Oxford University of Chicago

Thomas D. Cook, Institute for Policy Research, North- Jeffrey G. Reed, School of Business, Marian University
western University David Rindskopf, Department of Psychology, City
Alice H. Eagly, Department of Psychology, Northwest- University of New York
ern University Hannah R. Rothstein, Zicklin School of Business, Ba-
Alan Gomersall, Centre for Evidence and Policy, Kings ruch College
College London Paul R. Sackett, Department of Psychology, University
Lesley Grayson, Centre for Evidence and Policy, Kings of Minnesota
College London Julio Snchez-Meca, Faculty of Psychology, Univer-
Sander Greenland, Departments of Epidemiology and sity of Murcia, Spain
Statistics, University of California , Los Angeles William R. Shadish, Department of Psychological Sci-
Adam R. Hafdahl, Department of Mathematics, Wash- ences, University of California, Merced
ington University in St. Louis Jack L. Vevea, Department of Psychology, University
Rebecca Herman, AIR, Washington, D.C. of California, Merced

Julian P. T. Higgins, MRC Biostatistics Unit, Cam- Anne Wade, Centre for the Study of Learning and Per-
bridge formance, Concordia University

Janet S. Hyde, Department of Psychology, University Howard D. White, College of Information Science and
of Wisconsin Technology, Drexel University

Sema Kalaian, College of Technology, Eastern Michi- David B. Wilson, Department of Administration of
gan University Justice, George Mason University

Spyros Konstantopoulos, Lynch School of Education, We also received personal assistance form a host of ca-
Boston College pable individuals. At Duke University, Melissa Pascoe
and Sunny Powell managed the financial transactions for
Julia Littell, Graduate School of Social Work and So- the Handbook. At Northwestern University, Patti Fergu-
cial Research, Bryn Mawr College son provided logistical support. At the University of Lou-
Michael A. McDaniel, School of Business, Virginia isville, Aaron Banister helped assemble the glossary. At
Commonwealth University the Russell Sage Foundation, Suzanne Nichols oversaw
David Myers, AIR, Washington, D.C. the project. To the chapter authors, reviewers, and others
who helped us with our work, we express our deepest
Ingram Olkin, Department of Statistics, Stanford Uni- thanks.
versity
Harris Cooper
Andy Oxman, Preventive and International Health Care
Larry Hedges
Unit, Norwegian Knowledge Centre for the Health
Jeff Valentine
Services
Anthony Petrosino, WestED, Boston, Massachusetts
ABOUT THE AUTHORS

Pam M. Baxter is a data archivist at the Cornell Institute Alice H. Eagly is James Padilla Chair of Arts and Sci-
for Social and Economic Research (CISER) of Cornell ences, professor of psychology, faculty fellow of the In-
University. stitute for Policy Research, and department chair of psy-
chology at Northwestern University.
Betsy Jane Becker is professor of educational psychol-
ogy and learning systems, College of Education, Florida Joseph L. Fleiss was professor of biostatistics at the
State University. Mailman School of Public Health, Columbia University.
Jesse A. Berlin is a researcher with Johnson & Johnson Leon J. Gleser is professor of statistics at the University
Pharmaceutical Research and Development. of Pittsburgh.
Michael Borenstein is director of Biostat in Englewood, Joel B. Greenhouse is professor of statistics at Carnegie
New Jersey. Mellon University.
Geoffrey D. Borman is professor of educational leader- Jeffrey A. Grigg is project assistant at the Wisconsin
ship and policy analysis, educational psychology, and ed- Center for Education Research at the University of Wis-
ucational policy studies at the University of Wisconsin, consin, Madison.
Madison.
C. Keith Haddock is associate professor of behavioral
Brad J. Bushman is research professor at the Institute health at the University of Missouri, Kansas City.
for Social Research, University of Michigan and visiting
professor, VU University, Amsterdam, the Netherlands. Larry V. Hedges is Board of Trustees Professor of Statis-
tics and Social Policy and faculty fellow at the Institute
Mike Clarke is a researcher at the U.K. Cochrane Centre for Policy Research, Northwestern University.
in Oxford, United Kingdom.
Sally Hopewell is a research scientist at the U.K. Co-
Thomas D. Cook is professor of sociology, psychology, chrane Centre.
education, and social policy, Joan and Serepta Harrison
Chair in Ethics and Justice, and faculty fellow of the In- Satish Iyengar is professor of statistics in the Depart-
stitute for Policy Research at Northwestern University. ment of Statistics at the University of Pittsburgh.
Harris Cooper is professor of psychology and neuro- Spyros Konstantopoulos is assistant professor at the
science at Duke University. Lynch School of Education, Boston College.
David S. Cordray is professor of public policy and of Huy Le is assistant professor of psychology at the Uni-
psychology at Vanderbilt University. versity of Central Florida.

xv
xvi ABOUT THE AUTHORS

Mark W. Lipsey is director of the Center for Evaluation Frank L. Schmidt is Gary Fethke Professor of Manage-
Research and Methodology, and a senior research associ- ment and Organizations at the Henry B. Tippie College of
ate at the Vanderbilt Institute for Public Policy Studies, Business, University of Iowa.
Vanderbilt University.
William R. Shadish is professor and Founding Faculty
Georg E. Matt is professor of psychology at San Diego at the University of California, Merced.
State University.
Alex J. Sutton is a reader in Medical Statistics in the
Paul Morphy is a predoctoral fellow at Vanderbilt Department of Health Sciences at the University of
University. Leicester.
In-Sue Oh is a doctoral candidate at the Henry B. Tippie Jeffrey C. Valentine is assistant professor in the Depart-
College of Business, University of Iowa. ment of Educational and Counseling Psychology at the
Ingram Olkin is professor of statistics and of education College of Education and Human Development, Univer-
and CHP/PCOR Fellow at Stanford University. sity of Louisville.

Robert G. Orwin is a senior study director with Westat, Jack L. Vevea is associate professor of psychology at the
Inc. University of California, Merced.

Therese D. Pigott is associate professor at the School of Morgan C. Wang is professor of statistics at the Univer-
Education, Loyola University Chicago. sity of Central Florida.

Stephen W. Raudenbush is Lewis-Sebring Professor in Howard D. White is professor emeritus at the College
the Department of Sociology and the College Chair of the of Information Science and Technology, Drexel Univer-
Committee on Education at the University of Chicago. sity.
Jeffrey G. Reed is professor of management and dean of David B. Wilson is associate professor in the Department
the School of Business at Marian University. of Administration of Justice, George Mason University.
Hannah R. Rothstein is professor of management at the Wendy Wood is James B. Duke Professor of Psychology
Zicklin School of Business, Baruch College, The City and Neuroscience and professor of marketing at Duke
University of New York. University.
1
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS

HARRIS COOPER LARRY V. HEDGES


Duke University Northwestern University

CONTENTS
1.1 Introduction 4
1.1.1 Replication and Research Synthesis 4
1.2 Research Synthesis in Context 4
1.2.1 A Definition of the Literature Review 4
1.2.2 A Definition of Research Synthesis 6
1.3 A Brief History of Research Synthesis as a Scientific Process 7
1.3.1 Early Developments 7
1.3.2 Research Synthesis Comes of Age 7
1.3.3 Rationale for the Handbook 10
1.4 Stages of Research Synthesis 11
1.4.1 Problem Formulation 11
1.4.2 Literature Search 12
1.4.3 Data Evaluation 12
1.4.4 Data Analysis 13
1.4.5 Interpretation of Results 14
1.4.6 Public Presentation 14
1.5 Chapters in Perspective 15
1.6 References 15

3
4 INTRODUCTION

It is necessary, while formulating the problems of which in our advance we are to find
the solutions, to call into council the views of those of our predecessors who have de-
clared an opinion on the subject, in order that we may profit by whatever is sound in
their suggestions and avoid their errors.
Aristotle, De Anima, Book 1, chapter 2

1.1 INTRODUCTION
criterion) produced the correct finding. If results that are
The moment we are introduced to science we are told it expected to be very similar show variability, the scientific
is a cooperative, cumulative enterprise. Like the artisans instinct should be to account for the variability by further
who construct a building from blueprints, bricks, and systematic work.
mortar, scientists contribute to a common edifice, called
knowledge. Theorists provide our blueprints and re- 1.2 RESEARCH SYNTHESIS IN CONTEXT
searchers collect the data that are our bricks.
To extend the analogy further yet, we might say that 1.2.1 A Definition of the Literature Review
research synthesists are the bricklayers and hodcarriers of
the science guild. It is their job to stack the bricks accord- The American Psychological Associations PsycINFO
ing to plan and apply the mortar that makes the whole reference database defines a literature review as the
thing stick. process of conducting surveys of previously published
Anyone who has attempted a research synthesis is en- material (http://gateway.uk.ovid.com). Common to all
titled to a wry smile as the analogy continues. They know definitions of literature reviews is the notion that they are
that several sets of theory-blueprints often exist, describ- not based primarily on new facts and findings, but on
ing structures that vary in form and function, with few a publications containing such primary information, whereby
priori criteria for selecting between them. They also know the latter is digested, sifted, classified, simplified, and
that our data-bricks are not all six-sided and right-angled. synthesized (Manten 1973, 75).
They come in a baffling array of sizes and shapes. Mak- Table 1.1 presents a taxonomy of literature reviews
ing them fit, securing them with mortar, and seeing that captures six distinctions used by literature review au-
whether the resulting construction looks anything like the thors to describe their own work (Cooper 1988, 2003).
blueprint is a challenge worthy of the most dedicated, in- The taxonomy was developed and can be applied to liter-
spired artisan. ature reviews appearing throughout a broad range of both
the behavioral and physical sciences. The six features and
1.1.1 Replication and Research Synthesis their subordinate categories permit a rich level of distinc-
tion among works of synthesis.
Scientific literatures are cluttered with repeated studies of The first distinction among literature reviews concerns
the same phenomena. Multiple studies on the same prob- the focus of the review, or, the material that is of central
lem or hypothesis arise because investigators are unaware interest to the reviewer. Most literature reviews center on
of what others are doing, because they are skeptical about one or more of four areas: the findings of individual pri-
the results of past investigations, or because they wish to mary studies; the methods used to carry out research;
extend (that is, generalize or search for influences on) theories meant to explain the same or related phenomena;
previous findings. Experience has shown that even when and the practices, programs, or treatments being used in
considerable effort is made to achieve strict replication, an applied context.
results across studies are rarely identical at any high level The second characteristic of a literature review is its
of precision, even in the physical sciences (see Hedges goals. Goals concern what the preparers of the review
1987). No two bricks are exactly alike. hope to accomplish. The most frequent goal for a review
How should scientists proceed when results differ? is to integrate past literature that is believed to relate to a
First, it is clear how they should not proceed: they should common topic. Integration includes: formulating general
not pretend that there is no problem, or decide that just one statements that characterize multiple specific instances
study (perhaps the most recent one, or the one they con- (of research, methods, theories, or practices); resolving
ducted, or a study chosen via some other equally arbitrary conflict between contradictory results, ideas, or statements
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS 5

Table 1.1 A Taxonomy of Literature Reviews should stimulate future work, and methodological prob-
lems or problems in logic and conceptualization that have
Characteristic Categories impeded progress within a topic area or field.
Of course, reviews more often than not have multiple
Focus research findings
goals. So, for example, it is rare to see the integration or
research methods
theories critical examination of existing work without also seeing
practices or applications the identification of central issues for future endeavors.
A third characteristic that distinguishes among litera-
Goal integration
ture reviews, perspective, relates to whether the reviewers
generalization
conflict resolution have an initial point of view that might influence the dis-
linguistic bridge-building cussion of the literature. The endpoints on the continuum
criticism of perspective might be called neutral representation and
identification of central issues espousal of a position. In the former, reviewers attempt to
Perspective neutral representation present all arguments or evidence for and against various
espousal of position interpretations of the problem. The presentation is meant
to be as similar as possible to those that would be pro-
Coverage exhaustive
vided by the originators of the arguments or evidence. At
exhaustive with selective citation
representative the opposite extreme of perspective, the viewpoints of re-
central or pivotal viewers play an active role in how material is presented.
The reviewers accumulate and synthesize the literature in
Organization historical
the service of demonstrating the value of the particular
conceptual
methodological point of view that they espouse. The reviewers muster ar-
guments and evidence so that it presents their contentions
Audience specialized scholars in the most convincing manner.
general scholars
Of course, reviewers attempting to achieve complete
practitioners or policy makers
general public neutrality are likely doomed to failure. Further, reviewers
who attempt to present all sides of an argument do not
source: Cooper 1988. Reprinted with permission from Transaction preclude themselves from ultimately taking a strong posi-
Publishers. tion based on the cumulative evidence. Similarly, review-
ers can be thoughtful and fair while presenting conflicting
of fact by proposing a new conception that accounts evidence or opinions and still advocate for a particular
for the inconsistency; and bridging the gap between con- interpretation.
cepts or theories by creating a new, common linguistic The next characteristic, coverage, concerns the extent
framework. to which reviewers find and include relevant works in their
Another goal for literature reviews can be to critically paper. It is possible to distinguish at least four types of
analyze the existing literature. Unlike a review that seeks coverage. The first type, exhaustive coverage, suggests
to integrate existing work, a review that involves a critical that the reviewers hope to be comprehensive in the presen-
assessment does not necessarily summate conclusions or tation of the relevant work. An effort is made to include
compare the covered works one to another. Instead, it the entire literature and to base conclusions and discus-
holds each work up against a criterion and finds it more sions on this comprehensive information base. The second
or less acceptable. Most often, the criterion will include type of coverage also bases conclusions on entire litera-
issues related to the methodological quality if empirical tures, but only a selection of works is actually described in
studies, the logical rigor, completeness or breadth of ex- the literature review. The authors choose a purposive sam-
planation if theories are involved, or comparison with the ple of works to cite but claim that the inferences drawn are
ideal treatment, when practices, policies or applications based on a more extensive literature. Third, some review-
are involved. ers will present works that are broadly representative of
A third goal that often motivates literature reviews is to many other works in a field. They hope to describe just
identify issues central to a field. These issues may include a few exemplars that are descriptive of numerous other
questions that have given rise to past work, questions that works. The reviewers discuss the characteristics that make
6 INTRODUCTION

the chosen works paradigmatic of the larger group. In the empirical research for the purpose of creating generaliza-
final coverage strategy, reviewers concentrate on works tions. Implicit in this definition is the notion that seeking
that were highly original when they appeared and influ- generalizations also involves seeking the limits of gener-
enced the development of future efforts in the topic area. alizations. Also, research syntheses almost always pay
This may include materials that initiated a line of investi- attention to relevant theories, critically analyze the re-
gation or thinking, changed how questions were framed, search they cover, try to resolve conflicts in the literature,
introduced new methods, engendered important debate, or and attempt to identify central issues for future research.
performed a heuristic function for other scholars. According to Derek Price, research syntheses are in-
A fifth characteristic of literature reviews concerns how tended to replace those papers that have been lost from
a paper is organized. Reviews may be arranged histori- sight behind the research front (1965, 513). Research
cally, so that topics are introduced in the chronological synthesis is one of a broad array of integrative activities
order in which they appeared in the literature, conceptu- that scientists engage in; as demonstrated in the quote
ally, so that works relating to the same abstract ideas leading this chapter, its intellectual heritage can be traced
appear together, or methodologically, so that works em- back at least as far as Aristotle.
ploying similar methods are grouped together. Using the taxonomy described, we can make further
Finally, the intended audiences of reviews can vary. specifications concerning the type of research syntheses
Reviews can be written for groups of specialized research- that are the focus of this book. With regard to perspective,
ers, general researchers, policy makers, practitioners, or readers will note that much of the material is meant to
the general public. As reviewers move from addressing help synthesists produce statements about evidence that
specialized researchers to addressing the general public, are neutral in perspective, that is, less likely to be affected
they employ less technical jargon and detail, while often by bias or by their own subjective outlooks. The material
paying greater attention to the implications of the work on searching the literature for evidence is meant to help
being covered. synthesists uncover all the evidence, not simply positive
studies (that might be overrepresented in published re-
1.2.2. A Definition of Research Synthesis search) or evidence that is easy for them to find (that
might be overly sympathetic to their point of view or dis-
The terms research synthesis, research review, and sys- ciplinary perspective). The material on the reliability of
tematic review are often used interchangeably in the so- extracting information from research reports and how
cial science literature, though they sometimes connote methodological variations in research should be handled
subtly different meanings. Regrettably, there is no con- is meant to increase the interjudge reliability of these pro-
sensus about whether these differences are really mean- cesses. The methods proposed for the statistical integra-
ingful. Therefore, we will use the term research synthesis tion of findings is meant to ensure the same rules about
most frequently throughout this book. The reason for this data analysis are consistently applied to all conclusions
choice is simple. In addition to its use in the context of about what is and is not significant, both statistically and
research synthesis, the term research review is used as practically. Finally, the material on explicit and exhaus-
well to describe the activities of evaluating the quality of tive reporting of the methods used in syntheses is meant
research. For example, a journal editor will obtain re- to assist users in evaluating if or where subjectivity may
search reviews when deciding whether to publish a man- have crept into the process and to replicate findings for
uscript. Because research syntheses often include this themselves if they choose to do so.
type of evaluative review of research, using the term re- Finally, the term meta-analysis often is used as a syn-
search synthesis avoids confusion. The term systematic onym for research synthesis. However, in this volume, it
review is less often used in the context of research evalu- will be used in its more precise and original meaningto
ation, though the confusion is still there, and also it is describe the quantitative procedures that a research syn-
a term less familiar to social scientists than is research thesist may use to statistically combine the results of
synthesis. studies. Gene Glass coined the term to refer to the statis-
A research synthesis can be defined as the conjunction tical analysis of a large collection of analysis results from
of a particular set of literature review characteristics. Most individual studies for the purpose of integrating the find-
definitional about research syntheses are their primary ings (1976, 3). The authors of the Handbook reserve the
focus and goal: research syntheses attempt to integrate term to refer specifically to statistical analysis in research
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS 7

synthesis and not to the entire enterprise of research syn- the century (1990). For example, Pearson took the aver-
thesis. Not all research syntheses are appropriate for age of estimates from five separate samples of the corre-
meta-analysis. lation between inoculation for enteric fever and mortality
(1904). He used this average to better estimate the typical
1.3. A BRIEF HISTORY OF RESEARCH SYNTHESIS effect of inoculation and to compare this effect with that
AS A SCIENTIFIC ENTERPRISE of inoculation for other diseases. Early work on the meth-
odology for combination of estimates across studies in-
1.3.1 Early Developments cludes papers in the physical sciences by Raymond Birge
(1932) and in statistics by William G. Cochran (1937),
In 1971, Kenneth Feldman demonstrated remarkable pre- Frank Yates and Cochran (1938), and Cochran (1954).
science: Systematically reviewing and integrating . . . Methods for combining probabilities across studies
the literature of a field may be considered a type of re- also have a long history, dating at least from procedures
search in its own rightone using a characteristic set of suggested in Leonard Tippetts The Method of Statistics
research techniques and methods (86). He described (1931) and Ronald Fishers Statistical Methods for
four steps in the synthesis process: sampling topics and Research Workers (1932). The most frequently used
studies, developing a scheme for indexing and coding probability-combining method was described by Freder-
material, integrating the studies, and writing the report. ick Mosteller and Robert Bush in the first edition of the
The same year, Richard Light and Paul Smith (1971) Handbook of Social Psychology more than fifty years ago
presented what they called a cluster approach to research (1954).
synthesis, meant to redress some of the deficiencies in the Still, the use of quantitative synthesis techniques in
existing strategies for integration. They argued that if the social sciences was rare before the 1970s. Late in
treated properly the variation in outcomes among related that decade, several applications of meta-analytic tech-
studies could be a valuable source of informationrather niques captured the imagination of behavioral scientists:
than merely a source of consternation as it appeared to be in clinical psychology, Mary Smith and Gene Glasss
when treated with traditional cumulating methods. meta-analysis of psychotherapy research (1977); in
Three years later, Thomas Taveggia struck a comple- industrial-organizational psychology, Frank Schmidt and
mentary theme: John Hunters validity generalization of employment
tests; in social psychology, Robert Rosenthal and Donald
A methodological principle overlooked by [synthesists] . . . Rubins integration of interpersonal expectancy effect re-
is that research results are probabilistic. What this principle search (1977), and, in education, Glass and Smiths syn-
suggests is that, in and of themselves, the findings of any thesis of the literature on class size and achievement
single research are meaninglessthey may have occurred (1978).
simply by chance. It also follows that, if a large enough
number of researches has been done on a particular topic,
chance alone dictates that studies will exist that report in- 1.3.2 Research Synthesis Comes of Age
consistent and contradictory findings! Thus, what appears
to be contradictory may simply be the positive and negative Two papers that appeared in the Review of Educational
details of a distribution of findings. (1974, 39798) Research in the early 1980s brought the meta-analytic
and synthesis-as-research perspectives together. The first
Taveggia went on to describe six common problems in proposed six synthesis tasks analogous to those per-
research synthesis: selecting research; retrieving, index- formed during primary research (Jackson 1980, 441).
ing, and coding information from studies; analyzing the Gregg Jackson portrayed meta-analysis as an aid to the
comparability of findings; accumulating comparable task of analyzing primary studies but emphasized its lim-
findings; analyzing distributions of results, and; reporting itations as well as its strengths. Also noteworthy about his
of results. paper was his use of a sample of thirty-six articles from
The development of meta-analytic techniques extends prestigious social science periodicals to examine the
back further in time but their routine use by research syn- methods used in integrative empirical syntheses. He re-
thesists is also relatively recent. Where Glass gave us the ported, for example, that only one of the thirty-six syn-
term meta-analysis in 1976, Ingram Olkin pointed out that theses reported the indexes or retrieval systems used to
ways to estimate effect sizes have existed since the turn of locate primary studies. He concluded that relatively little
8 INTRODUCTION

thought has been given to the methods for doing integra- an independent specialty within the statistical sciences.
tive reviews. Such reviews are critical to science and This book, summarizing and expanding nearly a decade
social policy making and yet most are done far less rigor- of programmatic developments by the authors, not only
ously than is currently possible (459). covered the widest array of meta-analytic procedures but
Harris Cooper drew the analogy between research syn- also presented rigorous statistical proofs establishing
thesis and primary research to its logical conclusion their legitimacy.
(1982). He presented a five stage model of synthesis as a Another text that appeared in 1984 also helped elevate
research project. For each stage, he codified the research research synthesis to a more rigorous level. Richard Light
question asked, its primary function in the synthesis, and and David Pillemer focused on the use of research synthe-
the procedural differences that might cause variation in ses to help decision-making in the social policy domain.
synthesis conclusions. In addition, Cooper applied the Their approach placed special emphasis on the importance
notion of threats-to-inferential-validitywhich Donald of meshing both numbers and narrative for the effective
Campbell and Julian Stanley introduced for evaluating interpretation and communication of synthesis results.
the utility of primary research designsto the conduct of Several other books appeared on the topic of meta-
research synthesis (1966; also see Shadish, Cook, and analysis during the 1980s and early 1990s. Some of these
Campbell 2002) . Cooper identified ten threats to validity treated the topic generally (for the most recent editions,
specifically associated with synthesis procedures that see, for example, Cooper 2009; Hunter and Schmidt
might undermine the trustworthiness of a research syn- 2004; Lipsey and Wilson 2001; Wolf 1986), some treated
thesis findings. He also suggested that other threats might it from the perspective of particular research design con-
exist and that the validity of any particular synthesis ceptualizations (for example, Mullen 1989), some were
could be threatened by consistent deficiencies in the set tied to particular software packages (for example, John-
of studies that formed its database. Table 1.2 presents a son 1993), and some looked to the future of research syn-
recent revision of this schema (Cooper 2007) that pro- thesis as a scientific endeavor (for example, Wachter and
poses a six stage model that separates the original analy- Straf 1990; Cook et al. 1992). The first edition of the
sis and interpretation stage into two distinct stages. Handbook of Research Synthesis appeared in 1994.
The first half of the 1980s also witnessed the appear- Readers interested in a popular history of meta-analysis
ance of four books primarily devoted to meta-analytic can consult Morton Hunts 1997 How Science Takes Stock:
methods. In 1981, Gene Glass, Barry McGaw, and Mary The Story of Meta-Analysis. In 2006, Mark Petticrew and
Smith presented meta-analysis as a new application of Helen Roberts provided a brief unsystematic history of
analysis of variance and multiple regression procedures, systematic reviews that includes discussion of early at-
with effect sizes treated as the dependent variable. In tempts to appraise the quality of studies and synthesize re-
1982, John Hunter, Frank Schmidt, and Greg Jackson in- sults. In 1990, Olkin did the same for meta-analysis.
troduced meta-analytic procedures that focused on com- Literally thousands of research syntheses have been
paring the observed variation in study outcomes to that published since the first edition of the Handbook in 1994.
expected by chance (the statistical realization of Taveg- Figure 1.1 presents some evidence of the increasing im-
gias point) and correcting observed effect size estimates pact of research syntheses on knowledge in the sciences
and their variance for known sources of bias (for exam- and social sciences. The figure is based on entries in the
ple, sampling error, range restrictions, unreliability of Science Citation Index Expanded and the Social Sciences
measurements). In 1984, Rosenthal presented a compen- Citation Index, according to the Web of Science reference
dium of meta-analytic methods covering, among other database (2006). It charts the growth in the number of cita-
topics, the combining of significance levels, effect size tions to documents including the terms research synthesis,
estimation, and the analysis of variation in effect sizes. systematic review, research review, and meta-analysis in
Rosenthals procedures for testing moderators of varia- their title or abstract during the years 1995 to 2005 and this
tion in effect sizes were not based on traditional inferen- number of citations divided by the total number of docu-
tial statistics, but on a new set of techniques involving ments in the two databases. The figure indicates that both
assumptions tailored specifically for the analysis of study the total number of citations and the number of citations
outcomes. Finally, in 1985, with the publication of Statis- per document in the databases has risen every year without
tical Procedures for Meta-Analysis, Hedges and Olkin exception. Clearly, the role that research syntheses play in
helped to elevate the quantitative synthesis of research to our knowledge claims is large and growing larger.
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS 9

Table 1.2 Research Synthesis Conceptualized as a Research Process

Stage Characteristics

Stage Research Question Primary Function Procedural Variation


Define the Problem What research evidence will Define the variables and Variation in the conceptual
be relevant to the problem relationships of interest so breadth and detail of
or hypothesis of interest in that relevant and irrelevant definitions might lead to
the synthesis? studies can be differences in the research
distinguished operations deemed
relevant and/or tested as
moderating influences
Collect the Research Evidence What procedures should be Identify sources (e.g., Variation in searched sources
used to find relevant reference databases, and extraction procedures
research? journals) and terms used might lead to systematic
to search for relevant differences in the retrieved
research and extract research and what is
information from reports known about each study
Evaluate the Correspondence What retrieved research Identify and apply criteria to Variation in criteria for
between Methods and should be included or separate correspondent decisions about study
Implementation of Studies excluded from the from incommensurate inclusion might lead to
and the Desired Synthesis synthesis based on the research results systematic differences in
Inferences suitability of the methods which studies remain in
for studying the synthesis the synthesis
question or problems in
research implementation?
Analyze (Integrate) the What procedures should be Identify and apply Variation in procedures used
Evidence from Individual used to summarize and procedures for combining to analyze results of
Studies integrate the research results across studies and individual studies
results? testing for differences in (narrative, vote count,
results between studies averaged effect sizes) can
lead to differences in
cumulative results
Interpret the Cumulative What conclusions can be Summarize the cumulative Variation in criteria for
Evidence drawn about the research evidence with labeling results as
cumulative state of the regard to its strength, important and attention to
research evidence? generality, and limitations details of studies might
lead to differences in
interpretation of findings
Present the Synthesis Methods What information should be Identify and apply editorial Variation in reporting might
and Results included in the report of guidelines and judgment lead readers to place more
the synthesis? to determine the aspects of or less trust in synthesis
methods and results outcomes and influences
readers of the synthesis others ability to
report need to know replication results

source: Cooper 2007.


10 INTRODUCTION

4,000 0.30%

3,500
0.25%

Citations per Document


3,000
Number of Citations

0.20%
2,500

(lines)
(bars)

2,000 0.15%

1,500
0.10%
1,000
0.05%
500

0 0.00%
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Year

Figure 1.1 Citations to Articles, per Document


source: Authors compilation.
note. Based on entries in the Science Citation Index Expanded and the Social Sciences Citation Index, according to the Web of Science
reference database (retrieved June 28, 2006). Bars chart the growth in the number of citations to documents including the terms research
synthesis, systematic review, research review or meta-analysis in their title or abstract during the years following the publication of the first
edition of the Handbook of Research Synthesis. Lines chart this number of citations divided by the total number of documents in the two
databases.

In the past two decades the use of research synthesis (http://www.campbellcollaboration.org/) was begun with
has spread from psychology and education through many similar objectives for the domain of social policy analy-
disciplines, especially the medical sciences and social sis, focusing initially on policies concerning education,
policy analysis. Indeed, the development of scientific social welfare, and crime and justice.
methods for research synthesis has its own largely inde- Because of the efforts of scholars who chose to apply
pendent history in the medical sciences (see Chalmers, their skills to how research syntheses might be improved,
Hedges, and Cooper 2002). A most notable event in syntheses written since the 1980s have been held to stan-
medicine was the establishment of the U.K. Cochrane dards far more demanding than those applied to their pre-
Centre in 1992. The Centre was meant to facilitate the decessors. The process of elevating the rigor of syntheses
creation of an international network to prepare and main- is certain to continue into the twenty-first century.
tain systematic reviews of the effects of interventions
across the spectrum of health care practices. At the end 1.3.3 Rationale for the Handbook
of 1993, an international network of individuals, called
the Cochrane Collaboration (http://www.cochrane.org/ The Handbook of Research Synthesis and Meta-Analysis
index.htm), emerged from this initiative (Chalmers 1993; is meant to be the definitive volume for behavioral and
Bero and Rennie 1995). By 2006, the Cochrane Collabo- social scientists intent on applying the synthesis craft. It
ration was an internationally renowned initiative with distills the products of thirty years of developments in
11,000 people contributing to its work, in more than how research integrations should be conducted so as to
ninety countries. It is now the leading producer of re- minimize the chances of conclusions that are not truly re-
search syntheses in health care and is considered by flective of the cumulated evidence. Research synthesis in
many to be the gold standard for determining the effec- the 1960s was at best an art, at worst a form of yellow
tiveness of different health care interventions. Its library journalism. Today, the summarization and integration of
of systematic reviews numbers in the thousands. In studies is viewed as a research process in its own right, is
2000, an initiative called the Campbell Collaboration held to the standards of a scientific endeavor, and entails
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS 11

the application of data gathering and analyses techniques syntheses require considerably more than just the appli-
developed for its unique purpose. cation of quantitative procedures.
Numerous excellent texts on research synthesis exist. Finally, the handbook is meant for those who carry out
None, however, are as comprehensive as this volume. research syntheses. Discussions of theory and proof are
Some texts focus on statistical methods. These often kept to a minimum in favor of descriptions of the practi-
emphasize different aspects of statistical integration (for cal mechanics needed to apply well the synthesis craft.
example, combining probabilities, regression-analog mod- The chapters present multiple approaches to problem
els, estimating population effects from sampled effects solving and discuss the strengths and weaknesses of each
with known biases) and often approach research accumu- approach. Readers with a comfortable background in
lation from different perspectives. Although these texts analysis of variance and multiple regression and who
are complete within their domains, no single sourcebook have access to a research library should find the chapters
describes and integrates all the meta-analytic approaches. accessible. The Handbook authors want to supply work-
This volume incorporates quantitative statistical tech- ing synthesists with the needed expertise to interpret their
niques from all the synthesis traditions. It brings the lead- blueprints, to wield their mortar hoe and trowel.
ing authorities on the various meta-analytic perspectives
together in a single volume. As such, it is an explicit 1.4 STAGES OF RESEARCH SYNTHESIS
statement by its authors that all the statistical approaches
share a common assumptive base. The common base is The description of the stages of research synthesis pre-
not only statistical but also philosophical. Philosophi- sented in table 1.2 provides the conceptual organization
cally, all the approaches rest on the presupposition that of the handbook. In this section, we raise the principal is-
research syntheses need to be held to the same standards sues associated with each stage. This allows us to briefly
of rigor, systematicity, and transparency as the research introduce the content of each of the chapters that follow.
on which they are based. The second and later users of
data must be held as accountable for the validity of their 1.4.1 Problem Formulation
methods as were the first.
Several problems arising in the course of conducting a Formulating a problem in research synthesis is con-
quantitative synthesis have not received adequate treat- strained by one major factor: primary research on a topic
ment in any existing text. These include nonindependence must exist before a synthesis can be conducted. How
of data sets, synthesis of multivariate data sets, and sensi- much research? If the research question is important, it
tivity analysis, to name just a few. Every research syn- would be interesting to know how much research there is
thesist faces these problems, and strategies have been on the problem, even if the answer was none at all. The
developed for dealing with them. Some of their solutions methods of meta-analysis can be applied to literatures
are published in widely scattered journals, others are containing as few as two hypothesis tests. Under certain
passed on to colleagues through informal contacts. They circumstancesfor instance, researchers synthesizing a
have never received complete treatment within the same pair of replicate studies from their own labthe use of
text. The Handbook brings these topics together for the meta-analysis in this fashion might be legitimate. Yet,
first time. most scientists would argue that the benefits of such a
Further, texts focusing on the statistical aspects of inte- synthesis would be limited (and its chances for publica-
gration tend to give only passing consideration to other tion even more limited).
activities of research synthesis. These activities include: A more general answer to the how much research
the unique characteristics of problem formulation in re- question is that it varies depending on a number of char-
search synthesis, methods of literature search, coding and acteristics of the problem. All else being equal, conceptu-
evaluation of research reports, and the meaningful inter- ally broad topics would seem to profit from a synthesis
pretation and effective communication of synthesis re- only after the accumulation of a more varied and larger
sults. The existing texts that focus on these aspects of number of studies than would narrowly defined topics.
research synthesis tend not to be comprehensive in their Similarly, literatures that contain diverse types of opera-
coverage of statistical issues. Fully half of the chapters in tions also would seem to require a relatively large number
this volume deal with issues that are not statistical, evi- of studies before relatively firm conclusions could be
dencing the authors collective belief that high quality drawn from a synthesis. Ultimately, the arbiter of whether
12 INTRODUCTION

a synthesis is needed will not be numerical standards, but About one-third said they stopped their search when they
the fresh insights a synthesis can bring to a field. Indeed, felt their understanding and conclusions about the topic
although a meta-analysis cannot be performed without would not be affected by additional material. About one-
data, many social scientists see value in empty syntheses sixth said they stopped searching when retrieval became
that point to important gaps in our knowledge. When unacceptably difficult.
done properly, empty syntheses should have proceeded In contrast to the relatively well-defined sampling
through the stages of research synthesis, including careful frames available to primary researchers, literature search-
problem formulation and a thorough literature search. ers confront the fact that any single source of primary re-
Once enough literature on a problem has collected, ports will lead them to only a fraction of the relevant
the challengeand promiseof research synthesis be- studies, and a biased fraction at that. For example, among
comes evident. The problems that constrain primary the most egalitarian sources of literature are the reference
researcherssmall and homogeneous samples, limited databases, such as PsycINFO, ERIC, and Medline. Still,
time and money for turning constructs of interest into these broad, non-evaluative systems exclude much of the
operationsare less severe for synthesists. They can unpublished literature. Conversely, the least equitable
capitalize on the diversity in methods that has occurred literature searching technique involves accessing ones
naturally across primary studies. The heterogeneity of close colleagues and other researchers with an active in-
methods across studies may permit tests of theoretical terest in the topic area. In spite of the obvious biases,
hypotheses concerning the moderators and mediators of there is no better source of unpublished and recent works.
relations that have never been tested in any single pri- Further complicating the sampling frame problem for
mary study. Conclusions about the population and eco- synthesist is the fact that the relative utility and biases as-
logical validity of relations uncovered in primary research sociated with any single source will vary as a function of
may also receive more thorough tests in syntheses. characteristics of the research problem, including for ex-
Part II of the handbook focuses on issues in problem ample how long the topic has been the focus of study and
formulation. In chapter 2, Harris Cooper discusses in de- whether the topic is interdisciplinary in nature.
tail the issues mentioned above. In chapter 3, Larry These problems imply that research synthesists must
Hedges looks at the implications of different problem carefully consider multiple channels for accessing litera-
definitions for how study results will be statistically mod- ture and how the channels they choose complement one
eled. The major issues involve the populations of people another. The three chapters in part III are devoted to help-
and measurements that are the target of a synthesis infer- ing the synthesist consider and carry out this unique task.
ences; how broadly the key constructs are defined, espe- In chapter 4, Howard White presents an overview of
cially in terms of whether fixed or random effect models searching issues from the viewpoint of an information
are envisioned; and how choices among models influence scientist. Both chapter 5, by Jeffrey Reed and Pam Bax-
the precision of estimates and the statistical power of ter, and chapter 6, by Hannah Rothstein and Sally
meta-analytic tests. Hopewell, present in detail the practical considerations of
how to scour our brickyards and quarries.
1.4.2. Literature Search
1.4.3 Data Evaluation
The literature search is the stage of research synthesis
that is most different from primary research. Still, culling Once the synthesists have gathered the relevant literature,
through the literature for relevant studies is not unlike they must extract from each document those pieces of in-
gathering a sample of primary data. The target of a litera- formation that will help answer the questions that impel
ture search that is part of a synthesis attempting exhaus- research in the field. The problems faced during data cod-
tive coverage would be all the research conducted on the ing provide a strong test of the synthesists competence,
topic of interest. thoughtfulness, and ingenuity. The solutions found for
Cooper found that about half of the reviewers he sur- these problems will have considerable influence on the
veyed claimed that they conducted their search with the contribution of the synthesis. Part IV contains four chap-
intention of identifying all or most of the relevant litera- ters on data coding and evaluation.
ture (1987). About three-quarters said they did not stop The aspect of coding studies that engenders the
searching until they felt they had accomplished their goal. most debate involves how synthesists should represent
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS 13

differences in the design and implementation of primary usually involved intuitive processes taking place inside
studies. What is meant by quality in evaluating research the heads of the synthesists. Meta-analysis made these
methods? Should studies be weighted differently if they processes public and based them on explicit, shared, sta-
differ in quality? Should studies be excluded if they con- tistical assumptions (however well these assumptions
tain too many flaws? How does one rate the quality of were met). We would not accept as valid a primary re-
studies described in incomplete research reports? In searchers conclusion if it were substantiated solely by
chapter 7, Jeff Valentine examines the alternative ap- the statement I looked at the treatment and control scores
proaches available to synthesists for representing primary and I think the treated group did better. We would de-
research methodology. Beyond quality judgments, syn- mand some sort of statistical test (for example, a t-test) to
thesists make decisions about the classes of variables back up the claim. Likewise, we no longer accept I ex-
that are of potential interest to them. These can relate to amined the study outcomes and I think the treatment is
variables that predict outcomes, potential mediators of ef- effective as sufficient warrant for the conclusion of a re-
fects, and the differences in how outcomes are conceptu- search synthesis.
alized (and, therefore, measured). If a synthesist chooses Part V contains chapters that cover the components of
not to code a particular feature of studies, then it cannot synthesis dealing with combining study results. Brad
be considered in the analysis of results. General guide- Bushman and Morgan Wang, in chapter 11, describe pro-
lines for what information should be extracted from pri- cedures based on viewing the outcomes of studies dichot-
mary research reports are difficult to develop, beyond omously, as either supporting or refuting a hypothesis.
those that are very abstract. Instead, direction will come The next two chapters cover methods for estimating the
from the issues that arise in the particular literature, cou- magnitude of an effect, commonly called the effect size.
pled with the synthesists personal insights into the topic. Cohen defined an effect size as the degree to which the
Still, there are commonalities that emerge in how these phenomenon is present in the population, or the degree to
decisions are forged. Mark Lipsey, in chapter 8 (Identi- which the null hypothesis is false (1988, 910). In chap-
fying Variables and Analysis Opportunities), and David ter 12, Michael Borenstein describes the parametric mea-
Wilson, in chapter 9 (Systematic Coding), present sures of effect size. In chapter 13, Joseph Fleiss and Jesse
complementing templates for what generally should be Berlin cover the measures of effect size for categorical
included in coding frames. data, paying special attention to the estimate of effect
Once decisions on what to code have been made, syn- based on rates and ratios.
thesists needs to consider how to carry out the coding of To most research synthesists, the search for influences
the literature and how to assess the trustworthiness with on study results is the most exciting and rewarding part of
which the coding frame is implemented. There are nu- the process. The four chapters of part VI all deal with tech-
merous indexes of coder reliability available, each with niques for analyzing whether and why there are differ-
different strengths and weaknesses. Chapter 10, by Rob- ences in the outcomes of studies. As an analog to analysis
ert Orwin and Jack Vevea, describes strategies for reduc- of variance or multiple regression procedures, effect sizes
ing the amount of error that enters a synthesis during the can be viewed as dependent (or criterion) variables and
coding of the literatures features. Orwin and Veveas de- the features of study designs as independent (or predic-
scription of reliability assessment focuses on three major tor) variables. Because effect-size estimates do not all
approaches: measuring interrater agreement; obtaining have the same sampling uncertainty, however, they can-
confidence ratings from coders; and conducting sensitiv- not be combined using traditional inferential statistics. In
ity analysis, or using multiple coding strategies to deter- chapter 14, William Shadish and Keith Haddock describe
mine whether a synthesis conclusions are robust across procedures for averaging effects and generating confi-
different coding decision rules. dence intervals around these population estimates. In
chapter 15, Spyros Konstantopoulos and Larry Hedges
1.4.4 Data Analysis discuss analysis strategies when fixed-effects models are
used. In chapter 16, Stephen Raudenbush describes the
As our brief history of research synthesis revealed, the strategies appropriate for random-effects models.
analysis of accumulated research outcomes has been a Further, effect-size estimates may be affected by fac-
separate area of specialization within statistics. Three de- tors that attenuate their magnitudes. These may include,
cades ago, the actual mechanics of integrating research for example, a lack of reliability in the measurement
14 INTRODUCTION

instruments or restrictions in the range of values in the many salutary changes, most primary researchers still do
subject sample. Attenuating biases may be estimated and not think about a research report in terms of the needs of
corrected using the procedures that Frank Schmidt, Huy the next user of the data. Also, problems with missing
Le, and In-Sue Oh describe in chapter 17. data would still occur even if the needs of the synthesist
Part VII includes three chapters that focus on some were the paramount consideration in constructing a re-
special issues in the statistical treatment of research out- port. There is simply no way primary researchers can an-
comes. In chapter 18, Larry Hedges discusses combining ticipate the subtle nuances in their methods and results
special types of studies, in particular the increasing use of that might be identified as critical to piecing together
cluster- or group-randomized trials in social research and the literature five, ten, or more years later. In chapter 21,
the implication of this for the definition of effect sizes, Theresa Piggott describes how lost information can bear
their estimation, and particularly the computation of stan- on the results and interpretation of a research synthesis.
dard errors of the estimates. Chapter 19 tackles the She shows how the impact of missing data can be as-
vexing problem of what to do when the same study (or sessed by paying attention to some of its discoverable
laboratory, or researcher, if one wishes to push the point) characteristics. She also describes both simple and com-
provides more than one effect size to a meta-analysis. plex strategies for handling missing data in research syn-
Because estimates from the same study may not be inde- thesis. Joel Greenhouse and Satish Iyengar, in chapter 22,
pendent sources of information about the underlying describe sensitivity analysis in detail, including several
population value, their contributions to any overall esti- diagnostic procedures to assess the sensitivity and robust-
mates must be adjusted accordingly. In chapter 19, Leon ness of the conclusions of a meta-analysis. Alex Sutton, in
Gleser and Ingram Olkin describe both simple and more chapter 23, examines publication bias and describes pro-
complex strategies for dealing with dependencies among cedures for identifying whether data censoring has had an
effect sizes. impact on the meta-analytic result. He also details some
Sometimes, the focus of a research synthesis will not imputation procedures that can help synthesists estimate
be a simple bivariate relation but a model involving nu- what summary effect size estimates might have been in
merous variables and their interrelations. The model the absence of the selective inclusion of studies.
could require analysis of more than one simple relation The chapters in part IX bring us full circle, returning us
and/or complex partial or second-order relations. In to the nature of the problem that led to undertaking the
chapter 20, Betsy Becker looks at what models in meta- research synthesis in the first place. Collectively, these
analysis are, why they should be used, and how they chapters attempt to answer the question as to what can be
affect meta-analytic data analysis. learned from research synthesis. In chapter 24, Wendy
Wood and Alice Eagly examine how theories can be
1.4.5 Interpretation of Results tested in research synthesis. In chapter 25, David Cordray
and Paul Morphy look at research synthesis for policy
Estimating and averaging effect sizes and searching for analysis. These chapters are meant to help synthesists
moderators of their variability is how the interpretation of think about the substantive questions for which the read-
cumulative study results begins. However, it must be fol- ers of their syntheses will be seeking answers.
lowed by other procedures that help the synthesists prop-
erly interpret their outcomes. Proper interpretation of the 1.4.6 Public Presentation
results of a research synthesis requires careful use of de-
clarative statements regarding claims about the evidence, Presenting the background, methods, results, and mean-
specification of what results warrants each claim, and any ing of a research synthesis findings provide the final
appropriate qualifications to claims that need to be made. challenges to the synthesists skill and intellect. In chap-
In part VIII, three important issues in data interpretation ter 26, the first of part X, Geoffrey Borman and Jeffrey
are examined. Grigg address the visual presentation of research synthe-
Even the most trustworthy coding frame cannot solve sis results. Charts, graphs, and tables, they explain, should
the problem of missing data. Missing data will arise in be used to summarize the numbers in a meta-analysis,
every research synthesis. Reporting of study features and along with a careful intertwining of narrative explication
results, as now practiced, leaves much to be desired. Al- to contextualize the summaries. Proper use of interpretive
though the emergence of research synthesis has led to devices can make a complex set of results easily accessible
RESEARCH SYNTHESIS AS A SCIENTIFIC PROCESS 15

to the intelligent reader. Likewise, inadequate use of these the Effects of Health Care. Annals of the New York Acad-
aides can render incomprehensible even the simplest con- emy of Sciences 703: 15663.
clusions. Chapter 27 describes the reporting format for Chalmers, Iain, Larry V. Hedges, and Harris Cooper. 2002.
research synthesis. As with the coding frame, there is no A Brief History of Research Synthesis. Evaluation and
simple reporting scheme that fits all syntheses. However, the Health Professions 25(1): 1237.
certain commonalities do exist and Mike Clarke discusses Cochran, William G. 1937. Problems Arising in the Analy-
them. Not too surprisingly, the organization that emerges sis of a Series of Similar Experiments. Journal of the
bears considerable resemblance to that of a primary re- Royal Statistical Society 4(Supplement): 10218.
search report although, also obviously, the content differs . 1954. The Combination of Estimates from Diffe-
dramatically. rent Experiments. Biometrics 10(1, March): 10129.
Cohen, Jacob. 1988. Statistical Power Analysis for the Beha-
1.5 CHAPTERS IN PERSPECTIVE vioral Sciences. Hillsdale, NJ: Lawrence Erlbaum.
Cook, Thomas D., Harris M. Cooper, David S. Cordray,
In part XI, the final two chapters of the handbook are Heidi Hartmann, Larry V. Hedges, Richard J. Light, Tho-
meant to bring the detailed drawings of the preceding mas Louis, and Frederick Mosteller. 1992. Meta-Analysis
chapters onto a single blueprint. In chapter 28, Georg for Explanation: A Casebook. New York: Russell Sage
Matt and Thomas Cook present an expanded analysis of Foundation.
threats to the validity of research syntheses, provide an Cooper, Harris M. 1982. Scientific Guidelines for Conduc-
overall appraisal of how inferences from syntheses may ting Integrative Research Reviews. Review of Educational
be restricted or faulty, and bring together many of the Research 52(2, Summer): 291302.
concerns expressed throughout the book. . 1987. Literature Searching Strategies of Integra-
Finally, chapter 29 also attempts to draw this material tive Research Reviewers: A First Survey. Knowledge:
together. In it, Harris Cooper and Larry Hedges look Creation, Diffusion, Utilization 8: 37283.
at some of the potentials and limitations of research syn- . 1988. Organizing Knowledge Synthesis: A Taxo-
thesis. Special attention is paid to possible future devel- nomy of Literature Reviews. Knowledge in Society 1:
opments in synthesis methodology, the feasibility and 10426.
expense associated with conducting a sound research . 2009. Research Synthesis and Meta-Analysis: A
synthesis, and a broad-based definition of what makes a Step-by-Step Approach. Thousand Oaks, Calif.: Sage
research synthesis good or bad. Publications.
No secret will be revealed by stating our conclusion in . 2003. Editorial. Psychological Bulletin 129(1):
advance. If procedures for the synthesis of research are 39.
held to standards of objectivity, systematicity, and rigor, . 2007. Evaluating and Interpreting Research Syn-
then our knowledge edifice will be made of bricks and theses in Adult Learning and Literacy. Boston, Mass.: Na-
mortar. If not, it will be a house of cards. tional College Transition Network, New England Literacy
Resource Center/World Education, Inc.
1.6 REFERENCES Feldman, Kenneth A. 1971. Using the Work of Others:
Some Observations on Reviewing and Integrating. Socio-
Bero, Lisa, and Drummond Rennie. 1995. The Cochrane logy of Education 4(1, Winter): 86102.
Collaboration: Preparing, Maintaining, and Disseminating Fisher, Ronald A. 1932. Statistical Methods for Research
Systematic Reviews of the Effects of Health Care. Journal Workers, 4th ed. London: Oliver & Boyd.
of the American Medical Association 274(24): 19358. Glass, Gene V. 1976. Primary, Secondary, and Meta-Analy-
Birge, Raymond T. 1932. The Calculation of Error by sis. Educational Researcher 5(10): 38.
the Method of Least Squares. Physical Review 40(2): Glass, Gene V., and Mary L. Smith. 1978. Meta-Analysis of
20727. Research on the Relationship of Class Size and Achieve-
Campbell, Donald T., and Julian C. Stanley. 1966. Experi- ment. San Francisco: Far West Laboratory for Educational
mental and Quasi-Experimental Designs for Research. Research and Development.
Chicago: Rand McNally. Glass, Gene V., Barry McGaw, and Mary L. Smith. 1981.
Chalmers, Iain. 1993. The Cochrane Collaboration: Prepar- Meta-Analysis in Social Research. Beverly Hills, Calif.:
ing, Maintaining and Disseminating Systematic Reviews of Sage Publications.
16 INTRODUCTION

Hedges, Larry V. 1987. How Hard Is Hard Science, How Pearson, Karl. 1904. Report on Certain Enteric Fever Inno-
Soft is Soft Science? American Psychologist 42(5): culation Statistics. British Medical Journal 3: 12436.
44355. Petticrew, Mark, and Helen Roberts. 2006. Systematic Re-
Hedges, Larry V., and Ingram Olkin. 1985. Statistical Meth- views in the Social Sciences: A Practical Guide. Oxford:
ods for Meta-Analysis. Orlando, Fl.: Academic Press. Blackwell Publishing.
Hunt, Morton. 1997. How Science Takes Stock: The Story of Price, Derek J. de Solla. 1965. Networks of Scientific Pa-
Meta-Analysis. New York: Russell Sage Foundation. pers. Science 149(3683): 5664.
Hunter, John E., and Frank L. Schmidt. 2004. Methods of Rosenthal, Robert. 1984. Meta-Analytic Procedures for So-
Meta-Analysis: Correcting Error and Bias in Research Find- cial Research. Beverly Hills, Calif.: Sage Publications.
ings, 2nd ed. Thousand Oaks, Calif.: Sage Publications. Rosenthal, Robert, and Donald B. Rubin. 1978. Interperso-
Hunter, John E., Frank L. Schmidt, and Greg B. Jackson. nal Expectancy Effects: The First 345 Studies. The Beha-
1982. Meta-Analysis: Cumulating Research Findings vioral and Brain Sciences 3: 37786.
Across Studies. Beverly Hills, Calif.: Sage Publications. Schmidt, Frank L., and John E. Hunter. 1977. Development
Jackson, Gregg B. 1980. Methods for Integrative Reviews. of a General Solution to the Problem of Validity Generali-
Review of Educational Research 50: 43860. zation. Journal of Applied Psychology 62: 52940.
Johnson, Blair T. 1993. DSTAT: Software for the Meta- Shadish, William. R., Thomas D. Cook, and Donald T. Camp-
Analytic Review of Research. Hillsdale, N.J.: Lawrence bell. 2002. Experimental and Quasi-Experimental Designs
Erlbaum. for Generalized Causal Inference. Boston, Mass.: Houghton
Light, Richard J., and David B. Pillemer. 1984. Summing Mifflin.
Up: The Science of Reviewing Research. Cambridge, Mass.: Smith, Mary L., and Gene V. Glass. 1977. Meta-Analysis of
Harvard University. Psychotherapy Outcome Studies. American Psychologist
Light, Richard J., and Paul V. Smith. 1971. Accumulating 32(9): 75260.
Evidence: Procedures for Resolving Contradictions Among Taveggia, Thomas C. 1974. Resolving Research Contro-
Research Studies. Harvard Educational Review 41(4): versy Through Empirical Cumulation: Toward Reliable So-
42971. ciological Knowledge. Sociological Methods & Research
Lipsey, Mark W., and David B. Wilson. 2001. Practical 2(4): 395407.
Meta-Analysis. Thousand Oaks, Calif.: Sage Publications. Tippett, Leonard Henry Caleb. 1931. The Methods of Statis-
Manten, A. A. 1973. Scientific Literature Review. Scho- tics. London: Williams & Norgate.
larly Publishing 5: 7589. Wachter, Kenneth W., and Miron L. Straf, eds. 1990. The
Mosteller, Frederick, and Robert Bush. 1954. Selected Future of Meta-Analysis. New York: Russell Sage Founda-
Quantitative Techniques. In Handbook of Social Psycho- tion.
logy, vol. 1, Theory and Method, edited by Gardner Lind- Wolf, Frederic M. 1986. Meta-Analysis: Quantitative Meth-
zey. Cambridge, Mass.: Addison-Wesley. ods for Research Synthesis. Beverly Hills, Calif.: Sage Publi-
Mullen, Brian. 1989. Advanced BASIC Meta-Analysis. Hill- cations.
sdale, N.J.: Lawrence Erlbaum. Yates, Frank, and William G. Cochran. 1938. The Analysis
Olkin, Ingram. 1990. History and Goals. In The Future of of Groups of Experiments. Journal of Agricultural Science
Meta-Analysis, edited by Kenneth W. Wachter and Miron L. 28: 55680.
Straf. New York: Russell Sage Foundation.
2
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS

HARRIS COOPER
Duke University

CONTENTS
2.1 Introduction 20
2.2. Defining Variables 20
2.2.1 Conceptual and Operational Definitions 20
2.2.1.1 Distinctions Between Operational Definitions in Primary Research
and Research Synthesis 21
2.2.2 The Fit Between Concepts and Operations 22
2.2.2.1 Broadening and Narrowing of Concepts 22
2.2.2.2 Multiple Operations and Concept to Operation Fit 22
2.2.2.3 The Use of Operations Meant to Represent Other Concepts 23
2.2.3 Effects of Multiple Operations on Synthesis Outcomes 23
2.2.4 Variable Definitions and the Literature Search 23
2.3 Types of Problems and Hypotheses 24
2.3.1 Three Questions About Research Problems and Hypotheses 24
2.3.1.1 Descriptions, Associations and Explanations 25
2.3.1.2 Examining Change Within or Variation Across Units 25
2.3.2 Problems and Hypotheses in Research Synthesis 26
2.3.2.1 Synthesis of Descriptive Research 26
2.3.2.2 Synthesis of Group-Level Associations and Causal Relationships 27
2.3.2.3 Synthesis of Studies of Change in Units Across Time 29
2.3.3 Problems and Hypotheses Involving Interactions 30
2.4 Study-Generated and Synthesis-Generated Evidence 30
2.4.1 Study-Generated Evidence, Two-Variable Relationships 30
2.4.2 Study-Generated Evidence, Three-Variable Interactions 31
2.4.2.1 Integrating Interaction Results Across Studies 31
2.4.3 Synthesis-Generated Evidence, Two-Variable Relationships 32
2.4.4 Synthesis-Generated Evidence, Three-Variable Interactions 32

19
20 FORMULATING A PROBLEM

2.4.5 High and Low Inference Codes 33


2.4.6 Value of Synthesis-Generated Evidence 33
2.5 Conclusion 34
2.6 Notes 35
2.7 References 35

2.1 INTRODUCTION 2.2 DEFINING VARIABLES


Texts on research methods often list sources of research In its most basic form, a research problem includes a
ideas (see, for example, Cherulnik 2001). Sometimes clear statement of what variables are to be related to one
ideas for research come from personal experiences, some- another and an implication of how the relationship can be
times from pressing social issues. Sometimes a researcher tested empirically (Kerlinger and Lee 1999). The ratio-
wishes to test a theory meant to help understand the roots nale for a research problem can be that some practical
of human behavior. Yet other times researchers find top- consideration suggests any discovered relation might be
ics by reading scholarly research. important. Or, the problem can contain a prediction about
Still, sources of ideas for research are so plentiful that a particular link between the variablesbased on theory
the universe of possibilities seems limitless. Perhaps we or previous observation. This type of prediction is called
must be satisfied with Karl Mannheims (1936) sugges- a hypothesis.
tion that the ideas researchers pursue are rooted in their Either rationale can be used for undertaking primary
social and material relations, from what the researcher research or research synthesis. For example, we might
deems important (Coser 1971). Then at least, we have a want to test the hypothesis that teachers expectations
course of action for studying their determinants. concerning how a students intelligence will change over
I will not begin my exploration of problems and hy- the course of a school year will cause the students intelli-
potheses in research synthesis by examining the determi- gence to change in the expected direction. This hypothesis
nants of choice. Nor will I describe the existential bounds might be based on a well-specified model of how inter-
of the researcher. Instead, I start with a simple statement, personal communications change as a function of the
that the problems and hypotheses in research syntheses teachers beliefs. Another example might be that we are
generally are drawn from those that already have had interested in how students perceptions of their instruc-
data collected on them in primary research. Then, my tors in high school and college correlate with achieve-
task becomes much more manageable. In this chapter I ment in the course. Here, we may know the problem is
examine the characteristics of research problems and hy- important, but may have no precise theory that leads to a
potheses that make them empirically testable and are hypothesis about whether and how perceptions of the in-
similar and different for primary research and research structor are related to achievement in the course. This re-
synthesis. search problem is more exploratory in nature.
That syntheses for the most part are in fact tied to only
those problems that have previously generated data does 2.2.1 Conceptual and Operational Definitions
not mean that research synthesis is not a creative exercise.
The process of gathering and integrating research often The variables involved in social science research must be
requires the creation of explanatory models to help make defined in two ways. First, each variable must be given a
sense of related studies that produced incommensurate conceptual or theoretical definition. This describes quali-
data. These schemes can be novel, having never appeared ties of the variable that are independent of time and space
in previous theorizing or research. The cumulative results but which can be used to distinguish events that are and
of studies are much more complex than the results of any are not relevant to the concept (Shoemaker, Tankard, and
single study. Discovering why two studies that appear to Lasorsa 2004). For instance, a conceptual definition of
be direct replications of one another have produced con- intelligence might be a capacity for learning, reasoning,
flicting results is a deliberative challenge for any research understanding, and similar forms of mental activity (Ran-
synthesist. dom House 2001, 990).
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 21

Conceptual definitions can differ in breadth, or in the techniques for recognizing when students have experi-
number of events to which they refer. Thus, if we define enced a change in mental acuity.
achievement as something accomplished esp. by supe- 2.2.1.1 Distinctions Between Operational Defini-
rior ability, special effort, great courage, etc. (Random tions in Primary Research and Research Synthesis
House 2001, 15), the concept is broader than if we confine The first critical distinction between operational defini-
the domain of achievement to academic tasks, or activi- tions in primary research and research synthesis is that
ties related to school performance in cognitive domains. primary researchers cannot start data collection until the
The broader definition of achievement would include variables of interest have been given a precise operational
goals reached in physical, economic, and social spheres, definition, an empirical reality. Otherwise, the researcher
as well as academic ones. So, if we are interested in the does not know what data to collect. A primary researcher
relationship between student ratings of instructors and studying intelligence must pick the measure or measures
achievement writ large we would include ratings taken in of intelligence they wish to study, ones that fit nicely into
gym, health, and drama classes. If we are interested in their conceptual definition, before the study begins.
academic achievement only, these types of classes would In contrast, research synthesists need not be quite so
fall outside our conceptual definition. The need for well- conceptually or operationally precise, at least not initially.
specified conceptual definitions is no different in research The literature search for a research synthesis can begin
synthesis than it is in primary research. Both primary re- with only a broad, and sometimes fuzzy, conceptual defi-
searchers and research synthesists must clearly specify nition and a few known operations that measure the
their conceptual definitions. construct. The search for studies then might lead the syn-
To relate concepts to concrete events, the variables in thesists not only to studies that define the construct in the
empirical research must also be operationally defined. manners they specified, but also to research in which the
An operational definition is a description of the charac- same construct was studied with different operational
teristics of observable events that are used to determine definitions. Then, the concept and associated operations
whether the event represents an occurrence of the con- can grow broader or narrowerand hopefully more
ceptual variable. So, for example, a person might be preciseas the synthesists grow more familiar with how
deemed physically fit (a conceptual variable) if they can the construct has been defined and measured in the extant
run a quarter-mile in a specified time (the operational research. Synthesists have the comparative luxury of being
definition of the conceptual variable). Put differently, a able to evaluate the conceptual relevance of different op-
concept is operationally defined by the procedures used erations as they grow more familiar with the literature.
to produce and measure it. Again, both primary research- They can even modify the conceptual definition as they
ers and research synthesists must specify the operations encounter related alternative concepts and operations in
included in their conceptual definitions. the literature. This can be both a fascinating and anxiety-
An operational definition of the concept intelligence arousing task. Does intelligence include what is measured
might first focus on scores from standardized tests meant by the SAT? Should achievement include performance in
to measure reasoning ability. This definition might be gym class?
specified further to include only tests that result in Intelli- I do not want to give the impression that research syn-
gence Quotient (IQ) scores, for example, the Stanford- thesis permits fuzzy thinking. Typically, research synthe-
Binet and the Wechsler tests. Or, the operational definition sists begin with a clear idea of the concepts of interest and
might be broadened to include the Scholastic Aptitude with some a priori specification of related empirical real-
Test (SAT), the Graduate Record Exam (GRE), or the izations. However, during a literature search it is not un-
Miller Analogies Test (MAT). This last group might be usual to come across definitions that raise issues about
seen as broadening the definition of intelligence because concept boundaries they may not have considered or oper-
these tests are relatively more influenced by the test- ations that they did not know existed but seem relevant to
takers knowledge than their native ability to reason and the construct being studied. Some of these new consider-
understand. Operations can also include the ingredients ations may lead to a tweaking of the conceptual definition.
of a treatment. For example, an experimenter might de- In sum, primary researchers need to know exactly what
vise an intervention for teachers meant to raise their ex- events will constitute the domain to be sampled before be-
pectations for students. The intervention might include ginning data collection. Research synthesists may discover
information on how malleable intelligence might be and unanticipated elements of the domain along the way.
22 FORMULATING A PROBLEM

Another distinction between the two types of inquiry is conceptual definition, or excluding many studies that oth-
that primary studies will typically involve only one, and ers might deem relevant.
sometimes a few, operational definitions of the same con- Thus, it is not unusual for a dialogue to take place be-
struct. In contrast, research syntheses usually include tween research synthesists and the research literature.
many empirical realizations. For example, primary re- The dialogue can result in an iterative redefining of the
searchers may pick a single measure of intelligence, if conceptual variables that are the focus of the synthesis as
only because the time and economic constraints of ad- well as of the included empirical realizations. As the lit-
ministering multiple measures of intelligence will be pro- erature search proceeds, it is extremely important that
hibitive. A few measures of academic achievement might synthesists take care to reevaluate the correspondence be-
be available to primary researchers, if they can access ar- tween the breadth of their concepts and the variation in
chived data (for example, grade point averages, achieve- ideas and operations that primary researchers have used
ment test scores). On the other hand, research synthesists to define them.
can uncover a wide array of operationalizations of the Before leaving the issue of concept-to-operation-fit, it
same construct within the same research area. is important to make one more point. The dialogue be-
tween synthesists and research literatures is one that must
2.2.2 The Fit Between Concepts proceed on the basis of an attempt to provide precise con-
and Operations ceptual definitions that lead to clear linkages between
concepts and operations in a manner that will allow
The variety of operationalizations that a research synthe- meaningful interpretation of results by the audience of
sist may uncover in the literature can be both a curse and the synthesis. What this means is that synthesists must
a blessing. The curse concerns the fit between concepts never let the results of the primary research dictate what
and operations. operations will be used to define a concept. For example,
2.2.2.1 Broadening and Narrowing Concepts Syn- it would be inappropriate for us to decide to operationally
thesists may begin a literature search with broad concep- define intelligence as cognitive abilities measured by IQ
tual definitions. However, they may discover that the tests and leave out results involving the Miller Analogies
operations used in previous relevant research have been Test because the IQ tests revealed results consistent with
confined to a narrower conceptualization. For instance, if our hypothesis and the MAT did not. The relevance of the
we are gathering and integrating studies about how stu- MAT must be based on how well its contents correspond
dent perceptions of instructors in high school and college to our conceptual definition, and that alone. If this test of
correlate with course achievement we might discover that intelligence produces results different from other tests,
all, or nearly all, past research has measured only liking then the reason for the discrepancy should be explored as
the instructor, and not the instructors perceived expertise part of the research synthesis, not used as a rationale for
in the subject matter or ability to communicate. Then, it the exclusion of MAT studies.
might not be appropriate for us to label the conceptual 2.2.2.2 Multiple Operations and Concept to Opera-
variable as student perceptions of instructors. When such tion Fit Multiple operations also create opportunities for
a circumstance arises, we need to narrow our conceptual research synthesists. Eugene Webb and his colleagues
definition to correspond better with the existing opera- presented strong arguments for the value of multiple
tions, such as liking the instructor. Otherwise, our conclu- operations (2000). They define multiple operationism as
sions might appear to apply more generally, to more the use of many measures that share a conceptual defini-
operations, than warranted by the evidence. tion but that have different patterns of irrelevant compo-
The opposite problem can also confront synthesists nents. Multiple operationism has positive consequences
that is, they start with narrow concepts but then find because
measures in the literature that could support broader defi-
nitions. For example, this might occur if we find many
once a proposition has been confirmed by two or more in-
studies of the effects of teachers expectations on SAT dependent measurement processes, the uncertainty of its
scores when we initially intended to confine our opera- interpretation is greatly reduced. . . . If a proposition can
tional definition of intelligence to IQ scores. We would survive the onslaught of a series of imperfect measures,
then face the choice of either broadening the allowable with all their irrelevant error, confidence should be placed
measures of intelligence, and therefore perhaps the in it. Of course, this confidence is increased by minimizing
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 23

error in each instrument and by a reasonable belief in the studies they generate may be very similar, and certainly
different and divergent effects of the sources of error. relevant to one another. When relevant operations associ-
(Webb et al. 2000, 35) ated with different constructs are identified, they should
be included in the synthesis. In fact, similar operations
So, the existence of a variety of operations in research generated by different disciplinary traditions often do not
literatures offers the potential benefit of stronger infer- share other features of research designfor example,
ences if it allows the synthesists to rule out irrelevant they may draw participants from different populations
sources of influence. However, multiple operations do not and therefore can be used to demonstrate the robustness
ensure concept-to-operation correspondence if all or most of results across methodological irrelevancies. Synthe-
of the operations lack a minimal correspondence to the sists can improve their chances of finding broadly relevant
concept. For example, studies of teacher expectation ef- studies by searching in reference databases for multiple
fects may manipulate expectations in a variety of ways disciplines and using the database thesauri to identify re-
by providing teachers with information on students in the lated, broader, and narrower terms.
form of tests scores, by reports from psychologistsand
may measure intelligence in a variety of ways. This 2.2.3 Effects of Multiple Operations
means that patterns of irrelevant variation in operations on Synthesis Outcomes
are different for different manipulations of expectations
Multiple operations do more than introduce the potential
and measures of intelligence. Teachers might find tests
for more robust inferences about relationships between
scores more credible than psychologists assessments. If
conceptual variables. They are an important source of
results are consistent regardless of the operations used
variability in the conclusions of different syntheses meant
then the finding that expectations influence intelligence
to address the same topic. A variety of operations can af-
has passed a test of robustness. However, if the results
fect synthesis outcomes in two ways.
differ depending on operations then qualifications to the
First, the operational definitions covered in two re-
finding are in order. For example, if we find that teacher
search syntheses involving the same conceptual variables
expectations influence SAT scores but not IQ scores we
can be different from one another. Thus, two syntheses
might postulate that perhaps expectations affect students
claiming to integrate the research on the effects of teacher
knowledge acquisition but not mental acuity. This infer-
expectations on intelligence can differ in the way intelli-
ence is less strong because it requires us to develop a post
gence is operationally defined. One might include the
hoc explanation for our findings.
SAT and GRE along with IQ tests and the other only IQ
2.2.2.3 The Use of Operations Meant to Represent
tests. Each may contain some operations excluded by the
Other Concepts Literature searches of reference data-
other, or one definition may completely contain the other.
bases typically begin with keywords that represent the
Second, multiple operations affect outcomes of syn-
conceptual definitions of the variables of interests. Thus
theses by leading to variation in the way study operations
we are much more likely to begin a search by crossing the
are treated after the relevant literature has been identified.
keyword teacher expectations with intelligence than we
Some synthesists pay careful attention to study opera-
are by crossing Test of Intellectual Growth Potential (a
tions. They identify meticulously the operational distinc-
made-up test that might have been used to manipulate
tions among retrieved studies and test for whether results
teacher expectations) with Stanford-Binet and Wechsler.
are robust across these variations. Other synthesists pay
It is the abstractness of the keywords that allows unex-
less attention to these details.
pected operationalizations to get caught in the search net.
By extending the keywords even further, searchers 2.2.4 Variable Definitions and
may uncover research that has been cast in conceptual the Literature Search
frameworks different from their own but which include
manipulations or measures relevant to the concepts they I have mentioned one way in which the choice of key-
have in mind. For instance, several concepts similar to words in a reference database search can influence the
interpersonal expectancy effects appear in the research studies that are uncovered. I pointed out that entering da-
literatureone is behavior confirmationbut come from tabases with abstract terms as keywords will permit the
different disciplinary traditions. Even though these con- serendipitous discovery of aligned areas of research, more
cepts are labeled differently the operations used in the so than will entering with operational descriptions. Now I
24 FORMULATING A PROBLEM

want to extend this recommendation further: synthesists definition of intelligence? Often, these analyses produce
should begin the literature search with the broadest con- the most interesting results in a research synthesis.
ceptual definition in mind. While reading study abstracts,
they should use the tag potentially relevant as liberally as
possible. At later stages, it is okay to exclude particular 2.3 TYPES OF PROBLEMS AND HYPOTHESES
operations on the basis of their lack of correspondence
with the precise conceptual definition. However, in the Once researchers have defined their concepts and opera-
early stages, when conceptual and operational boundaries tions, they next must define the problem or hypothesis of
may still be a bit fuzzy, the synthesist should err on the interest. Does the problem relate simply to the prevalence
overly inclusive side, just as primary researchers collect or level of a phenomenon in a population? Does it suggest
some data that might not later be used in analysis. that the variables of interest relate as a simple association
This strategy initially creates more work for synthe- or as a causal connection? Does the problem or hypothe-
sists, but it has several long-term benefits. First, it will sis refer to a process that operates within an individual
keep within easy reach operations that on first consider- unit1how it changes over timeor to general tenden-
ation may be seen as marginal but later jump the bound- cies between groups of units?
ary from irrelevant to relevant. Reconstituting the search The distinctions embodied in these questions have crit-
because relevant operations have been missed consumes ical implications for the choice of appropriate research
far more resources than first putting them in the In bin but designs. I assume the reader is familiar with research de-
excluding them later. signs and their implications for drawing inferences. Still,
Second, a good conceptual definition speaks not only to a brief discussion of the critical features of research prob-
which operations are considered relevant but also to which lems as they relate to research design is needed in order
are irrelevant. By beginning a search with broad terms, the to understand the role of design variations in research
synthesists are forced to struggle with operations that are synthesis.
at the margins of their conceptual definitions. Ultimately,
this results in conceptual definitions with more precise 2.3.1 Three Questions About Research
boundaries. For example, if we begin a search for studies Problems and Hypothesis
of teacher expectation effects with the keyword intelli-
gence rather than IQ, we may decide to include research Researchers need to answer three questions about their
using the MAT, thus broadening the intelligence construct research problems or hypotheses to be able to determine
beyond IQ, but exclude the SAT and GRE, because they the appropriateness of different research designs for ad-
are too influenced by knowledge, rather than mental acu- dressing them (for a fuller treatment of these questions,
ity. But, by explicitly stating these tests were excluded, see Cooper 2006):
readers gets a better idea of where the boundary of the
definition lies, and can argue otherwise if they choose. Should the results of the research be expressed in
Had we started the search with IQ, we might never have numbers or narrative?
known that others considered the SAT and GRE tests of Is the problem or hypothesis seeking to uncover
intelligence in teacher expectation research. a description of an event, an association between
Third, a broad conceptual search allows the synthesis events, or an explanation of an event?
to be carried out with greater operational detail. For ex- Does the problem or hypothesis seek to understand
ample, searching for and including IQ tests only permits how a process unfolds within an individual unit over
us to examine variations in operations such as whether time, or what is associated with or explains variation
the test was group versus individually administered and between units or groups of units?
timed versus untimed. Searching for, and ultimately in-
cluding, a larger range of operations permits the synthe- Because we are focusing on quantitative research syn-
sists to examine broader conceptual issues when they thesis, I will assume the first question was answered
cluster findings according to operations and look for numbers and provide further comment on the latter two
variation in results. Do teacher expectations produce dif- questions only. However, the gathering and integration of
ferent effects on IQ tests and the SAT? If so, are there narrative, or qualitative, research is an important area of
plausible explanations for this? What does it mean for the methodology and scholarship. The reader is referred to
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 25

works by George Noblit and Dwight Hare (1988) and phenomena are probably best categorized as modeling
Barbara Paterson and her colleagues (2001) for detailed studies.
examinations of approaches to synthesizing qualitative The second class involves quasi-experimental research
research. (Shadish, Cook, and Campbell 2002). Here, the research-
2.3.1.1 Descriptions, Associations, and Explana- ers (or some other external agent) control the introduc-
tions First, a research problem might largely be descrip- tion of an event, often called an intervention in applied
tive. Such problems typically ask what is happening. As research or a manipulation in basic research, but they do
in survey research, the desired description could focus on not control precisely who may be exposed to it. Instead,
a few specific characteristics of the event or events of in- the researchers use some statistical control in an attempt
terest and broadly sample those few aspects across multi- to equate the groups receiving and not receiving the inter-
ple event occurrences (Fowler 2002). Thus, we might ask vention. It is difficult to tell how successful the attempt at
a sample of undergraduates, perhaps drawn randomly equating groups has been. For example, we might be able
across campuses and in different subject areas, how much to train college instructors to teach classes using either a
they like their instructors. From this, we might draw an friendly or a distant instructional approach. However, we
inference about how well-liked instructors generally are might not be able to assign students to classes so we
on college campuses. might match students in each class on prior grades and
A second type of problem might be what events hap- ignore students who did not have a good match.
pen together. Here, researchers ask whether characteris- Finally, in experimental research both the introduction
tics of events or phenomena co-occur with one another. A of the event (for example, instructional approach) and
correlation coefficient might be calculated to measure the who is exposed to it (the students in each class) are con-
degree of association. For example, our survey of under- trolled by the researcher (or other external agent). The re-
graduates liking for their instructors might also ask the searcher uses a random procedure to assign students to
students how well they did in their classes. Then, the two conditions, leaving the assignment to chance (Boruch
variables could be correlated to answer the question of 1997). In our example, because the random assignment
whether students liking of instructors are associated with procedure minimizes average existing differences be-
their class grades. tween classes with friendly and distant instructors we can
The third research problem seeks an explanation for an be most confident that any differences between the classes
event. In this case, a study is conducted to isolate and in their final grades are caused by the instructional ap-
draw a direct productive link between one event (the proach of the teacher rather than other potential explana-
cause) and another (the effect). In our example, we might tions. Of course, there are numerous other aspects of the
ask whether increasing the liking students have for their design that must be attended to for this strong inference to
instructors causes them to get higher grades in class. be madefor example, standardizing testing conditions
Three classes of research designs are used most often across instructors and ensuring that instructors are un-
to help make these causal inferences. I call the first mod- aware of the experimental hypothesis. For our purposes,
eling research. It takes simple correlational research a however, focusing on the characteristics of researcher
step further by examining co-occurrence in a multivari- control over the conditions of the experiment and assign-
ate framework. For example, if we wish to know whether ment of participants to conditions captures what is needed
liking an instructor causes students to get higher grades, to proceed with the discussion.
we might construct a multiple regression equation or 2.3.1.2 Examining Change Within or Across Units
structural equation model that attempts to provide an Some research problems make reference to changes that
exhaustive description of a network of relational link- occur within a unit over time. Other research problems
ages (Kline 1998). This model would attempt to account relate to the average differences in and variability of a
for, or rule out, all other co-occurring phenomena that characteristic between groups of units. This latter problem
might explain away the relationship of interest. Likely, is best addressed using the designs just discussed. The
the model will be incomplete or imperfectly specified, so former problemthe problem of change within a unit
any casual inferences from modeling research will be would best be studied using the various forms of time se-
tentative, at best. Studies that compare groups of partici- ries designs, a research design in which units are tested at
pants to one anothersuch as men and women or differ- different times, typically equal intervals, during the
ent ethnic groupsand control for other co-occurring course of a study. For example, we might ask a student in
26 FORMULATING A PROBLEM

a class how much he or she liked one of his or her in- in this example is also a description of how one class-
structors after each weekly class meeting. We might also room might perform. Thus, the key to proper inference is
collect the students grade on that weeks homework that the unit of analysis at which the data from the study
assignment. A concomitant time series design would is being analyzed corresponds to the unit of interest in the
then reveal the association between change in liking and problem or hypothesis motivating the primary research or
homework grades. research synthesis. If not, an erroneous conclusion can be
If the research hypothesis seeks to test for a causal re- drawn.
lationship between liking and grades, it would call for
charting a students change in grades, perhaps on home- 2.3.2 Problems and Hypotheses in Research
work assignments or in-class quizzes, before and after an Synthesis
experimental manipulation changed the way the instruc-
tor behaved, perhaps from being distant to being friendly. 2.3.2.1 Synthesis of Descriptive Research When a
Or, the change from distant to friendly instruction might description of the quantitative frequency or level of an
involve having the instructor add a different friendly event is the focus of a research synthesis, it is possible to
teaching technique with each class session, for example, integrate evidence across studies by conducting what
leaving time for questions during session 2, adding re- Robert Rosenthal called aggregate analysis (1991). Floyd
vealing personal information during session 3, adding Fowler suggests that two types of descriptive questions
telling jokes during session 4. Then, we would chart how are most appropriately answered using quantitative ap-
each additional friendly teaching technique affected a proaches (2002). The first involves estimating the fre-
particular students grades on each sessions homework quency of an events occurrence. For example, we might
assignment. want to know how many college students receive A grades
Of course, for this type of design to allow strong causal in their classes. The second descriptive question involves
inferences other experimental controls would be needed collecting information about attitudes, opinions, or per-
for example, the random withdrawal and reintroduction of ceptions. For example, our synthesis concerning liking of
teaching techniques. For our purposes however, the key instructors might be used to answer how well on average
point is that it is possible that how each individual stu- college students like their instructors.
dents grades change as a function of the different teach- Gathering and integrating research on the question in-
ing techniques could look very different from the moving volving frequency of grades would lead us to collect from
group average of class-level homework grades. For exam- each relevant study the number of students receiving A
ple, the best description of how instructor friendliness af- grades and the total number of students in each study. The
fected grades based on how individual students react question regarding opinions would lead us to collect the
might be to, say, each students performance improved average response of students on a question about liking
precipitously after the introduction of a specific friendly their instructor.
teaching technique that seemed to appeal to that student The primary objective of the studies this data might
while other techniques produced little or no change. How- be collected from might not have been to answer our
ever, because different students might have found differ- descriptive questions. This evidence could have been re-
ent approaches appealing, averaged across the students in ported as part of a study examining the two-variable as-
the class, the friendliness manipulations might be causing sociation between liking of the instructor and grades.
gradual improvements in the average homework grade. At However, it is not unusual for such studies to report,
the class level of analysis the best description of the re- along with the correlation of interest, the distribution of
sults might be that gradual introduction of additional grades in the classes under study and the average liking
friendly teaching techniques led to a gradual improvement of the instructors. This might be done because the pri-
in the classes average homework grades. So, the group- mary researchers wanted to examine the data for restric-
averaged data can be used to explain improvement in tions in range (Was their sufficient variation in grades and
grades only at the group level; it is not descriptive of the liking to permit a strong test of their association?) or to
process happening at the individual level, nor visa versa. describe their sample for purposes of assessing the gener-
It is important to note that the designation of whether ality of their findings (How were the students in this study
an individual or group effect is of interest depends on the doing in school compared to the more general population
research question being asked. So, the group performance of interest?).
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 27

The difficulty of aggregating descriptive statistics Clearly, we cannot simply average these measures.
across studies depends on whether frequencies or atti- One solution would be to aggregate results only across
tudes, opinions, and perceptions are at issue. With regard studies using the identical scales and than we report results
to frequencies, as long as the event of interest is defined separately for each scale type. Another solution would be
similarly across studies, it is simple to collect the fre- to report simply the percentage of respondents using one
quencies, add them together and, typically, to report them side of the scale or the other. For example, our cumulative
as a proportion of all events. But there are still two com- statement might be Across all studies, 85 percent of stu-
plications. dents reported positive liking for their instructor. This
First, by using the proviso defined similarly I am percentage is simply the number of students whose rat-
making the assumption that, for example, it is the act of ings are on the positive side of the scale, no matter what
getting a grade of A that interests us, not that getting an A the scale, divided by the total number of raters. The num-
means the same thing in every class used in every study. ber can be derived from the raw rating frequencies, if they
If instructors use different grading schemes, getting an A are given, or from the mean and variance of the ratings
might mean very different things in different classes and (for an example, see Cooper et al. 2003).
different studies. Of course, if different studies report dif- It is rare to see research syntheses in social science that
ferent frequencies of A grades, we can explore whether seek to aggregate evidence across studies to answer de-
other features of studies (for example, the year in which scriptive questions quantitatively. When this occurs, how-
the study was conducted) covary with the grading curve ever, it is critical that the synthesists pay careful attention
used in their participating classrooms. to whether the measures being aggregated are commen-
Second, it might be that the aggregation of frequencies surate. Combining incommensurate measures will result
across studies is undertaken to estimate the frequency of in gibberish.
an event in a population. So, our motivating question 2.3.2.2 Synthesis of Group-Level Associations and
might be what the frequency is that the grade of A is as- Causal Relationships The accumulation and integration
signed in American college classes. The value of our ag- of group-level research for purposes of testing associa-
gregate estimate from studies will then depend on how tions and causal relationships has been by far the most
well the studies represented American college classes. It frequent objective of research syntheses in social science.
would be rare for classes chosen because they are conve- In primary research, a researcher selects the most appro-
nient to be fortuitously representative of the nation as a priate research design for investigating the problem or
whole. But, it might be possible to apply some sampling hypothesis. It is rare for a single study to implement more
weights that would improve our approximation of a pop- than one design to answer the same research question, al-
ulation value. though there may be multiple operationalizations of the
Aggregating attitudes, opinions, and perceptions across constructs of interest. More typical would be instances in
studies is even more challenging. This is because social which primary researchers intended to carry out an opti-
scientists often use different scales to measure these vari- mal design for investigating their problem but, due to un-
ables, even when the variables have the same conceptual foreseen circumstances, ended up implementing a less-
definition. Suppose we wanted to aggregate across stud- than-optimal design. For example, if we are interested in
ies the reported levels of liking of the instructor. It would asking, Does increasing the liking students have for their
not be unusual to find that some studies simply asked, instructors cause students to do better in class? we might
Do you like your instructor, yes or no? while others begin by intending to carry out an experiment in which
asked, How much do you like your instructor? but used high school students are randomly assigned to friendly
different scales to measure liking. Some might have used and distant instructors (who are also randomly assigned
ten-point scales and others five-point or seven-point to teaching style). However, as the semester proceeds,
scales. Scales with the same numeric gradations might some students move out of the school district, others
also differ in the anchors they used for the liking dimen- change sections because of conflicts with other courses,
sion. For example, some might have anchored the posi- and others change sections because they do not like the
tive end of the dimension with a lot, others with very instructor. By the end of the semester, it is clear that the
much, and still others with my best instructor ever. In students remaining in the sections were not equivalent,
this case, even if all the highest numerical ratings are on average, when the study began. To rescue the effort,
identical, the meaning of the response is different. we might use post hoc matching or statistical controls to
28 FORMULATING A PROBLEM

re-approximate equivalent groups. Thus, the experiment plausibility of causal networks but do not address causal-
has become a quasi-experiment. If students in the two ity in the generative sense, that is, whether manipulating
conditions at the end of the study are so different on pre- one variable will produce a change in the other variable.
tests that procedures to produce post hoc equivalence Well-conducted quasi-experiments may permit weak
seem inappropriate (for example, with small groups of causal inferences, made stronger through multiple and
students sampled from different tails of the class distribu- varied replications. Experiments using random assign-
tions), we might simply correlate the students ratings of ment of participants to groups permit the strongest infer-
liking (originally meant to be used as a manipulation ences about causality.
check in the experiment) with grades. So, a study search- How should synthesists interested in problems of cau-
ing for a causal mechanism has become a study of associ- sality treat the various designs? At one extreme, they can
ation. The legitimate inferences allowed by the study discard all studies but those using true experimental de-
began as strong causal inference, degraded to weak causal signs. This approach applies the logic that these are the
inference, and ended as simple association. only studies that directly test the question of interest. All
In contrast, research synthesists are likely to come other designs either address association only or do not
across a wide variety of research designs that relate permit strong inferences. The other extreme would be to
the concepts of interest to one another. A search for stud- include all the research evidence while carefully qualify-
ies using the term instructor likeability and examining ing inferences as the evidence for causality moves farther
them for measures of achievement should identify studies from the ideal. A less extreme approach would be to in-
that resulted in simple correlations, multiple regressions, clude some but perhaps not all designs while again care-
perhaps a few structural equation models, some quasi- fully qualifying inferences.
experiments, a few experiments, and maybe even some There are arguments for each of these approaches.
time series involving particular students or class averages. The first obligation of synthesists is to clearly state the
What should the synthesists do with this variety of re- approach they have employed and the rationale for it.
search designs? Certainly, our treatment of them depends In research areas where strong experimental designs are
first and foremost on the nature of the research problem relatively easy to conduct and plentifulfor example re-
or hypothesis. If our problem is, Is there an association search on the impact of drug treatments on the social ad-
between the liking students have for their instructors and justment of children with attention disorders (Ottenbacher
grades in the course? then it seems that any and all of the and Cooper 1984)I have been persuaded that including
research designs address the issue. The regressions, struc- only designs that permit relatively strong causal infer-
tural equation models, quasi-experiments, and experiments ences was an appropriate approach to the evidence.
ask a more specific question about associationthese In other instances, experiments may be difficult to con-
seek a causal connection or one with some other explana- duct and rarefor example the impact of homework on
tions ruled outbut they are tests of an association none- achievement (Cooper et al. 2006). Here, the synthesists
theless. Thus, it seems that we would be well advised to are forced to make a choice between two philosophically
include all the research designs in our synthesis of evi- different approaches to evidence. If the synthesists be-
dence on association. lieve that discretion is the better part of valor, then they
The issue is more complex when our research question might opt for including only the few experimental studies
deals with causality: Does increasing the liking students or stating simply that no credible evidence on the causal
have for their instructors cause higher student grades? relationship exists. Alternatively, if they believe that any
Here, the different designs produce evidence with differ- evidence is better than no evidence at all then they might
ent capabilities for drawing strong inferences about our proceed to summarize the less-than-optimal studies, with
problem. Again, correlational evidence addresses a nec- the appropriate precautions, of course.
essary but not sufficient condition for causal inference. Generally speaking, when experimental evidence on
As such, if this were the only research design found in causal questions is lacking or sparse I think a more inclu-
the literature, it would be appropriate for us to assert that sive approach is called for, assuming that the synthesists
the question remained untested. When an association pay careful and continuous attention to the impact of re-
is found, multiple regressions statistically control for search design on the conclusions that they draw. In fact,
some alternative explanations for the relationship, but the inclusive approach can provide some interesting ben-
probably not all. Structural equation models relate to the efits to inferences. Returning to our example of instructor
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 29

liking, we might find a small set of studies in which the provide information relevant for investigating another
likeability of the instructor has been manipulated and stu- type of problem. Typically, in primary research only a
dents assigned randomly to conditions. However, in order single research design can be employed in each study.
to accomplish the manipulation, these studies might have However, in research synthesis, a variety of research de-
been conducted in courses that involved a series of guest signs relating the variables of interest are likely to be
lecturers, used to manipulate instructors likeability. found. When the relation of interest concerns an associa-
Grades were operationalized as scores on homework as- tion between variables, designs that seek to rule out al-
signments turned in after each lecture. These studies ternative explanations or establish causal links are still
might have demonstrated that more likeable guest lectur- relevant. When the problem of interest concerns estab-
ers produced higher student grades on homework. Thus, lishing a causal link between variables, designs that seek
in order to carry out the manipulation it was necessary to associations or seek to rule out a specified set of alterna-
study likeability in short-term instructor-student interac- tive explanations, but not all alternative explanations, do
tions. Could it be that over timemore similar to the not match the question at hand. Still, if the constraints of
real-world relationship that develops between instructors conducting experiments place limits on these studies
and studentslikeability becomes less important and ability to establish construct and external validity, a syn-
perceived competence becomes more important? Could it thesis of the nonexperimental work can give a tentative,
be that the effect of likeability is short-livedit appears first approximation of the robustness, and limits, of infer-
on homework grades immediately after a class but does ences from the experimental work.
not affect how hard a student studies for exams and there- 2.3.2.3 Synthesis of Studies of Change in Units
fore has much less, if any, effect on test scores and final Across Time The study of how individual units change
grades? over time, and what causes this change, has a long history
These issues, related to construct and external validity, in the social sciences. Single-case research has developed
go unaddressed if only the experimental evidence is per- its own array of research designs and data analysis strate-
mitted into our synthesis of instructor likeability and gies (for an exposition of single-case research designs,
homework grades. Instead, we might use the nonexperi- see Barlow and Hersen 1985; for a discussion of data
mental evidence to help us gain tentative, first approxi- analysis issues, Thomas Kratochwill and Joel Levin
mations about how the likeability of instructors plays 1992). The issues I have discussed regarding descriptive,
out over time and with broader constructions of achieve- associational, and causal inferences and how they relate
ment. The quasi-experiments found in the literature might to different research designs play themselves out in the
use end-of-term class grades as outcome measures. The single-case arena in a manner similar to that in group-
structural equation models might use large, nationally- level research. Different time series designs are associated
representative samples of students and relate ratings of with different types of research problems and hypotheses.
liking of instructors in general to broad measures of For example, the question of how a students liking for
achievement, such as cumulative GPAs and SAT scores. the instructor changes over the course of a semester
By employing these results to form a web of evidence, would be most appropriately described using a simple
we can come to more or less confident interpretations of time series design. The question of whether a students
the experimental findings. If the non-experimental evi- liking for the instructor is associated with the students
dence reveals relationships consistent with the experi- grades on homework as the semester progresses would be
ments, we can be more comfortable in suggesting that the studied using a concomitant time series design. The ques-
experimental results generalize beyond the specific oper- tion of whether a change in a students liking for the in-
ations used in the experiments. If the different types of structor causes a change in the students homework grade
evidence are inconsistent, it should be viewed as a cau- would be studied using any of a number of interrupted
tion to generalization. time series designs. Much like variations in group de-
In sum then, it is critical in both primary research and signs, variations in these interrupted time series would
research synthesis that the type of relationship between result in stronger or weaker inferences about the causal
the variables of interested be clearly specified. This spec- relationship of interest.
ification dictates whether primary researchers have used The mechanics of quantitative synthesis of single-case
the appropriate research designs. Designs appropriate designs requires its own unique toolbox (for examples of
to gather data on one type of problem may or may not some approaches, see Busk and Serlin 1992; Faith, Alison,
30 FORMULATING A PROBLEM

and Gorman 1997; Scruggs, Mastropieri, and Casto 1987; Also, undertaking a synthesis to investigate the exis-
Van den Noortgate and Onghena 2003; White et al. 1989). tence of a main effect relationship should in no way di-
However, the logic of synthesizing single-case research is minish the attention paid to interactions within the same
identical to that of synthesizing group-level research. It project. Indeed, in my earlier example the fact that the
would not be unusual for synthesists to uncover a variety previous research synthesis on liking of the instructor
of time series designs relating the variables of interest to causing higher grades in high school did not look at the
one another. If we are interested in whether a change in a moderating influence of subject matter might be consid-
students liking of the instructor caused a change in the ered a shortcoming of that work. Typically, when main
students course grade we may very well find in the litera- effect relationships are found to be moderated by third
ture some concomitant time series, some interrupted time variables, these findings are given inferential priority and
series with a single liking intervention, and perhaps some are viewed as a step toward understanding the processes
designs in which liking is enhanced and then lessened as involved in the causal chain of events.
the semester progresses. Appropriate inferences drawn Examining interactions in research syntheses presents
from each of these designs correspond more or less closely some unique opportunities and unique problems. To fully
with our causal question. The synthesists may choose to understand both, I must introduce yet another critical dis-
focus on the most correspondent designs, or to include tinction in the types of evidence that can be explored
designs that are less correspondent but that do provide when research is gathered and integrated.
some information. The decision about which of these
courses of action are most appropriate again should be in- 2.4 STUDY-GENERATED AND
fluenced by the number and nature of studies available. SYNTHESIS-GENERATED EVIDENCE
And, of course, regardless of the decision rule adopted
concerning the inclusion of research designs, synthesists Research synthesis can contain two sources of evidence
are obligated to carefully delimit their inferences based about the research problem or hypothesis. The first type
on the strengths and weaknesses of the included designs. is called study-generated evidence. Study-generated evi-
dence is present when a single study contains results that
2.3.3 Problems and Hypothesis Involving directly test the relation being considered. Research syn-
Interactions theses also contain evidence that does not come from
individual studies but rather from the variations in proce-
At their broadest level, the problems and hypotheses that dures across studies. Such evidence, called synthesis-
motivate most research syntheses involve main effects, generated, is present when the results of studies using
relations between two variables. Does liking of the in- different procedures to test the same hypothesis are com-
structor cause higher class grades? Do teachers expecta- pared to one another.
tions influence intelligence test scores? This is due to the Any research problem or hypothesisa description,
need to establish such fundamental relationships before simple association, or causal linkcan be examined
investigating the effects of third variables on them. through either study-generated or synthesis-generated ev-
Of course, the preponderance in practice of main effect idence. However, only study-generated evidence based on
questions does not mean that a focus on an interaction experimental research allows us to make statements con-
cannot or should not motivate a research synthesis. For cerning causality.
example, it may be the case that a bivariate relationship is
so well established that an interaction hypothesis has be- 2.4.1 Study-Generated Evidence, Two-Variable
come the focus of attention in a field. The fact that liking Relationships
the instructor causes higher grades in high school might
be a finding little in dispute, perhaps because a previous Suppose we were interested in gathering and integrating
research synthesis has shown it to be so. Now, the issue is the research on whether teacher expectations influence
whether the causal impact is equally strong across classes IQ. We conduct a literature search and discover twenty
dealing with language or mathematics. So, we undertake studies that randomly assigned students to one of two
a new synthesis to answer the interaction question of conditions. Teachers for one group of students were given
whether the effect of liking on grades is equally strong information indicating they could expect unusual intel-
across different subject areas. lectual growth during the school year from the students in
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 31

their class. The other teachers were given no out-of-the- Study 1


ordinary information about any students. These twenty
14
studies each provide study-generated evidence regarding
10 High expectations
a simple two-variable relationship, or main effect. The

IQ Change
cumulative results of these studies comparing the end-of- 6
year IQ scores of students in the two groups of classes
2 Control
could then be interpreted as supporting or not supporting
the hypothesis that teacher expectations influence IQ. 2
6
2.4.2 Study-Generated Evidence, Three-Variable 0 4 8 12 16 20 24 28
Interactions Class Size

Next, assume we are interested as well in whether teacher Study 2


expectations manipulated through the presentation of a 14
bogus Test of Intellectual Growth Potential have more of
an impact on student IQs than the same information ma- 10 High expectations
nipulated by telling teachers simply that students past

IQ Change
teachers predicted unusual intellectual growth based on 6
subjective impressions. We discover ten of the twenty
studies used both types of expectation manipulation and 2 Control
randomly assigned teachers to each. These studies pro-
2
vide study-generated evidence regarding an interaction.
Here, if we found that test scores produced a stronger 6
expectation-IQ link than past teachers impressions, we 0 4 8 12 16 20 24 28
could conclude that the mode of expectation induction Class Size
caused the difference. Of course, the same logic would
apply to attempts to study higher order interactions Figure 2.1 Effects of Teacher Expectations on IQ
involving more than one interacting variablealthough source: Authors compilation.
these are still rare in research syntheses.
2.4.2.1 Integrating Interaction Results Across
Studies Often, the integration of interaction results in results of study 2 probably would have closely approxi-
research synthesis is not as simple as combining signifi- mated those of study 1 had the class sizes in study 2 rep-
cance levels or calculating and averaging effect sizes resented the same range of values as those in study 1.
from each study. Figure 2.1 illustrates the problem by Note that the slopes for the high expectations groups in
presenting the results of two hypothetical studies com- study 1 and study 2 are nearly identical, as are those for
paring the effects of manipulated (high) teacher expecta- the control groups.
tions on student IQ that also examined whether the effect This example demonstrates that synthesists should not
was moderated by the number of students in the class. In assume that strengths of interaction uncovered by differ-
study 1, the change in student IQ was compared across a ent studies necessarily imply inconsistent results. Synthe-
set of classrooms that ranged in size from ten to twenty- sists need to examine the differing ranges of values of
eight. In study 2 the range of class sizes was from ten to the variables employed in different studies, be they mea-
twenty. Study 1 might have produced a significant main sured or manipulated. If possible, they should chart re-
effect and a significant interaction involving class size but sults taking the different levels into account, just as I did
study 2 might have reported a significant main effect in figure 2.1. In this manner, one of the benefits of re-
only. search synthesis is realized. Although one study might
If we uncovered these two studies we might be tempted conclude that the effect of expectations dissipates as class
to conclude that they produced inconsistent results. How- size grows larger and a second study might not, the re-
ever, a closer examination of the two figures illustrates search synthesists can discover that the two results are in
why this might not be an appropriate interpretation. The fact perfectly commensurate.
32 FORMULATING A PROBLEM

This benefit of gathering and integrating research also 2.4.4 Synthesis-Generated Evidence,
highlights the importance of primary researchers present- Three-Variable Interactions
ing detailed information concerning the levels of variables
used in their studies. Research synthesists cannot conduct Lets return to our teacher expectations and IQ example.
an across-study analysis similar to this example without Suppose we find no studies that manipulated the mode of
the needed information. If the primary researchers in expectation induction but we do discover that ten of the
study 1 and study 2 neglected to specify their range of class studies that experimentally manipulated expectations did
sizes, perhaps because they simply said they compared so using test scores and that ten used the impressions
smaller classes with larger classes, the commensurability of previous teachers. When we compare the magnitude
of the results would have been impossible to demonstrate. of the expectation effect on IQ between the two sets of
Variations in ranges of values for variables also can studies, we discover that the link is stronger in studies
produce discrepancies in results involving two-variable, using test manipulations. We could then infer that an as-
or main effect, relationships. This is the circumstance sociation exists between mode-of-expectation-induction
under which the problem is least likely to be recognized and IQ but we could not infer a causal relation between
and is most difficult to remedy when it is discovered. the two. This is synthesis-generated evidence for an
interaction.
2.4.3 Synthesis-Generated Evidence, When groups of effect sizes are compared, regardless
Two-Variable Relationships of whether they come from simple correlational analyses
or controlled experiments using random assignment, the
Earlier I provided an example of a two-variable relation- synthesists can only establish an association between a
ship studied with synthesis-generated evidence when I moderator variablea characteristic of the studiesand
suggested we might relate the average ratings of liking the outcomes of studies. They cannot establish a causal
students gave their instructors to the year in which the connection. Synthesis-generated evidence is restricted to
study was conducted. We might do this to obtain an indi- making claims only about associations and not about
cation of whether liking of instructors was increasing or causal relationships because it is the ability to employ
decreasing over time. Here, we have taken two descrip- random assignment of participants that allows primary
tive statistics from the study reports and related them to researchers to assume third variables are represented
one another. This is synthesis-generated evidence for a equally in the experimental conditions.
two-variable relationship, or main effect. The possibility of unequal representation of third vari-
It should be clear that only associations can be studied ables across study characteristics cannot be eliminated
in this way, since we have not randomly assigned studies because we did not randomly assign experiments to
to liking conditions. It should also be clear that looking modes of expectation manipulation. For example, it might
for such relationships between study-level characteristics be that the set of studies using tests to manipulate IQ
opens up the possibility of examining a multitude of were also conducted in schools with smaller class sizes.
problems and hypotheses that might never have been the Now, we do not know whether it was the use of tests or
focus of individual primary studies. For example, even if the size of the classes that caused the difference in the re-
no previous study has examined whether liking for in- lationship between teacher expectations and IQ. The
structors has gone up or down over the years, we can look synthesists cannot discern which characteristic of the
at this relation in our synthesis. studies, or perhaps some unknown other variable related
The use of synthesis-generated evidence to study main to both, produced the stronger link. Thus, when study
effect relationships is rarely the primary focus of research characteristics are found associated with study outcomes
synthesis, although there is no pragmatic reason why this the synthesists should report the finding as just that, an
should be the case. Study characteristicsthat is, evidence association, regardless of whether the included studies
at the synthesis levelmost often come into play as vari- tested the causal effects of a manipulated variable or esti-
ables that influence the magnitude of two-variable rela- mated the size of an association.
tionships examined within studies. Conceptually, the Synthesis-generated evidence cannot legitimately be
difference is simply that in this casethe interaction used to rule out as possible true causes other variables
caseone of the two study characteristics being examined that are confounded with the study characteristic of inter-
already is some expression of a two-variable relationship. est. Thus, when synthesis-generated evidence reveals a
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 33

relationship that would be of special interest if it were High inference codes create a special set of problems.
causal, the report should include a recommendation that First, careful attention must be paid to the reliability of
future research examine this factor using a more system- inference judgments. So, it is important to show that these
atically controlled design, so that its causal impact can be judgments are being made consistently both across and
appraised. In our example, we would call for a primary within the inferers making the judgments. Also, inferers
study to experimentally manipulate both the mode of ex- are being asked to play the role of a research participant
pectation induction and the size of classes. in our example the teachers being given expectation
informationand the validity of role-playing methodolo-
2.4.5 High and Low Inference Codes gies has been the source of much controversy (Greenberg
and Folger 1988). However, Norman Miller, Ju-Young
The examples of study characteristics I have used this far Lee, and Michael Carlson empirically demonstrated that
might all be thought of as low inference codes. Variables high inference codes can lead to valid judgments (they
such as class size or the type of IQ measure require the did this by comparing inferences to manipulation checks)
coders only to locate the needed information in the re- and can add a new dimension to synthesists ability to in-
search report and transfer it to the database. In some terpret literatures and resolve controversies (1991). If
circumstances, synthesists might want to make more in- high inference information can be validly extracted from
ferential judgments about study operations. These high articles and the benefit of doing so is clear, than this can
inference codes typically involve attempting to infer how be an important technique for exploring problems and
a contextual aspect of the studies might have been inter- hypotheses in research synthesis.
preted by participants.
For example, I used the mode-of-manipulation as a
study-level variable that might have been examined as 2.4.6 Value of Synthesis-Generated Evidence
a potential moderator of the link between teacher expec-
tations and change in student IQ. I used bogus tests and In sum then, it is critical that synthesists keep in mind
previous teachers impressions as the two modes of ma- the distinction between study-generated and synthesis-
nipulation. Most manipulations of these types could be generated evidence. Only evidence coming from experi-
classified into the two categories with relatively little in- mental manipulations within a single study can support
ference on the part of the coders. However, suppose that assertions concerning causality. However, I do not want
we found in the literature a dozen different types of ex- my attention to the fact that synthesis-generated evidence
pectation manipulations, some involving long tests taken cannot support causal inferences to suggest that this
in class, others short tests taken after school, some the source of evidence should be ignored. As our examples
impressions of past teachers, and others the clinical judg- suggest, the use of synthesis-generated evidence allows
ments of child psychologists. It might still be relatively us to test relations which may have never been examined
simple to categorize these manipulations at an operational by primary researchers. In fact, it is typically the case that
level but the use of the multiple categories for testing synthesists can examine many more potential moderators
meaningful hypotheses at the synthesis-level becomes of study outcomes than have appeared as interacting vari-
more problematic. Now, we might reason that the under- ables in primary research. Often, these study-level vari-
lying conceptual difference between study operations ables are of theoretical importance. For example, it is
that interests us is the credibility of the manipulation. So, easy to see how our mode-of-expectation-induction vari-
in order to ask the question, Does the magnitude of the able relates to the credibility of the information given to
teacher expectation-student IQ link vary with the credi- teachers and how class size relates to the teachers oppor-
bility of the expectation manipulation? we might want tunities to communicate their expectations to students. By
to examine each study report and score the manipulations searching across studies for variations in the operational-
on this dimension. This high inference code than becomes izations of theoretically relevant construct, synthesists
a study characteristic we relate to study outcomes. Here, can produce the first evidence on these potentially critical
we are dimensionalizing the manipulations by making an moderating variables. Even though this evidence is equiv-
inference about how much credibility the participating ocal, it is a major benefit of research synthesis and a
teachers might have placed in the information given source of potential hypotheses (and motivation) for future
to them. primary research.
34 FORMULATING A PROBLEM

2.5 CONCLUSION The word potential was highlighted because if synthe-


sists only cursorily detail study operations, their conclu-
A good way to summarize what I have tried to convey sions may mask important moderators of results. An
in this chapter is to recast its major points by framing erroneous conclusionthat research results indicate neg-
them in the context of how they impact the validity of the ligible differences in outcomes across studiescan occur
conclusions drawn in research syntheses. The most cen- if different results across studies are masked in the use of
tral decisions made during problem and hypothesis for- very broad categories.
mulation concern the fit between the concepts used in So, holding constant the needs of the audience, and as-
variable definitions and the operational definitions found suming adequate resources are available to the synthe-
in the literature and the correspondence between the in- sists, the availability of ample studies to support a fit
ferences about relationships permitted by the evidence in between either narrow or broad conceptual definitions
the studies at hand and those drawn by the synthesists. suggests it is most desirable for syntheses to employ the
First, research synthesists need to clearly and carefully broadest possible conceptual definition. To test this pos-
define the conceptual variables of interest. Unlike pri- sibility, they should begin their literature search with a
mary researchers, synthesists need not complete this task few central operations but remain open to the possibility
before the search of the literature begins. Some flexibility that other relevant operations will be discovered in the
permits them to discover operations unknown to them be- literature. When operations of questionable relevance are
fore the literature search began. It also permits them to encountered, the synthesist should err toward making
expand or contract their conceptual definitions so that, in overly inclusive decisions, at least in the early stages of
the end, the conceptual definitions appropriately encom- their project. However, to complement this conceptual
pass operations present in the literature. If relatively broad broadness, synthesists should be thorough in their atten-
concepts accompany claims about the generality of find- tion to the distinctions in study characteristics. Any sug-
ings not warranted by the operationalizations in hand then gestion that a difference in study results is associated
an inferential error may be made. Likewise, relatively with a distinction in study characteristics should be tested
narrow concepts accompanied by marginally related op- using moderator analyses.
erations also may lead to inferential errors. Research syn- The second set of validity issues introduced during
thesists need to expand or contract conceptual definitions problem and hypothesis formulation concerns the nature
and/or retain or eliminate operations so that the highest of the relationship between the variables under study. Only
correspondence between them has been achieved. study-generated evidence can be used to support causal
When some discretion exists regarding whether to ex- relationships. When the relationship at issue is a causal
pand or contract definitions, the first decision rule should one, synthesists must take care to categorize research de-
be that the results will be meaningful to the target audience. signs according to their strength of causal inference. They
The second decision rule is pragmatic: Will expanding the might then exclude all but the most correspondent study-
definitions make the synthesis so large and unwieldy that generated evidence (that is to say, experiments using ran-
the effort will outstrip resources? If both of these concerns dom assignment). Or, they might separately examine less
are not negligible, sometimes synthesists will opt to use correspondent study-generated evidence and use this to
narrow definitions with only a few operations in order to obtain first approximations of the construct and external
define their concepts so as to ensure consensus about how validity of a finding.
their concepts are related to observable events. However, Also, synthesis-generated evidence cannot be used to
there is reason to favor using broader constructs with mul- make causal claims. Thus, synthesists should be careful
tiple realizations, as numerous rival interpretations for the to distinguish study-generated evidence from synthesis-
findings may be tested and at least tentatively ruled out if generated evidence. Of course, this does not mean that
the multiple operations produce similar results. Also, nar- synthesis-generated evidence is of little or no use. In
row concepts provide little information about the general- fact, synthesis-generated evidence is new evidence, often
ity or robustness of the results. Therefore, the greater the exploring problems and hypotheses unexplored in primary
conceptual breadth of the definitions, the greater is its po- research. As such, it is the first, among many, unique con-
tential to produce conclusions that are more general than tributions that research synthesis makes to the advance of
when narrow definitions are used. knowledge.
HYPOTHESES AND PROBLEMS IN RESEARCH SYNTHESIS 35

Kratochwill, Thomas R., and Joel R. Levin. 1992. Single


2.6 NOTES Case Research Design and Analysis: New Directions
1. I use the term unit here to refer to individual people, for Psychology and Education. Hillsdale, N.J.: Lawrence
groups, classrooms, or whatever entity is the focus of Erlbaum.
study. Mannheim, Karl. 1936. Ideology and Utopia. London: Rout-
ledge & Kegan Paul.
2.7 REFERENCES Miller, Norman, Ju-Young Lee, and Michael Carlson. 1991.
The Validity of Inferential Judgments When Used in
Barlow, David H., and Michel Hersen. 1985. Single Case Theory-Testing Meta-Analysis. Personality and Social
Research: Design and Methods. Columbus, Oh.: Merrill. Psychology Bulletin 17: 33543.
Boruch, Robert F. 1997. Randomized Experiments for Plan- Noblit, George W., and R. Dwight Hare. 1988. Meta-Ethno-
ning and Evaluation. Thousand Oaks, Calif.: Sage Publica- graphy: Synthesizing Qualitative Studies. Newbury Park,
tions. Calif.: Sage Publications.
Busk, Patricia L., and Ronald C. Serlin. 1992. Meta-Analysis Ottenbacher, Kenneth, and Harris Cooper. 1984. The Effect
for Single-Case Research. In Single Case Research De- of Class Placement on the Social Adjustment of Mentally
sign and Analysis: New Directions for Psychology and Retarded Children. Journal of Research and Development
Education, edited by Thomas R. Kratochwill and Joel R. in Education 17(2): 114.
Levin. Hillsdale, N.J.: Lawrence Erlbaum. Paterson, Barbara L., Sally E. Thorne, Connie Canam, and
Cherulnik, Paul D. 2001. Methods for Behavioral Research: Carol Jillings. 2001. Meta-Study of Qualitative Health Re-
A Systematic Approach. Thousand Oaks, Calif.: Sage Publi- search. Thousand Oaks, Calif.: Sage Publications.
cations. Random House. 2001. The Random House Dictionary of the
Cooper, Harris. 2006. Research Questions and Research English Language, the Unabridged Edition. New York:
Designs. In Handbook of Research in Educational Psy- Random House.
chology, 2nd ed., edited by Patricia A. Alexander, Philip H. Rosenthal, Robert. 1991. Meta-Analytic Procedures for So-
Winne, and Gary Phye. Mahwah, N.J.: Lawrence Erlbaum. cial Research. Newbury Park, Calif.: Sage Publications.
Cooper, Harris, Jeffrey C. Valentine, Kelly Charlton, and Scruggs, Thomas E., Margo A. Mastropieri, and Glendon
April Barnett. 2003. The Effects of Modified School Ca- Casto. 1987. The Quantitative Synthesis of Single-Subject
lendars on Student Achievement and School Community Research: Methodology and Validation. Remedial and
Attitudes: A Research Synthesis. Review of Educational Special Education 8: 2433.
Research 73: 152. Shadish, William. R., Thomas D. Cook, and Donald T. Cam-
Coser, Lewis. 1971. Masters of Sociological Thought. New pbell. 2002. Experimental and Quasi-Experimental Desi-
York: Harcourt Brace. gns for Generalized Causal Inference. Boston, Mass.:
Faith, Myles S., David B. Allison, and Bernard S. Gorman. Houghton Mifflin.
1997. Meta-Analysis of Single-Case Research. In Design Shoemaker, Pamela J., James William Tankard Jr., and Do-
and analysis of single-case research, edited by Ronald D. minic L. Lasorsa. 2004. How to Build Social Science Theo-
Franklin, David B. Allison, and Bernard S. Gorman. Ma- ries. Thousand, Oaks, Calif.: Sage Publications.
hwah, N.J.: Lawrence Erlbaum. Van den Noortgate, Wim, and Patrick Onghena. 2003.
Fowler, Jr., Floyd J. 2002. Survey Research Methods, 3rd ed. Combining Single-Case Experimental Data Using Hierar-
Thousand Oaks, Calif.: Sage Publications. chical Linear Models. School Psychology Quarterly 18:
Greenberg, Jerald, and Robert Folger. 1988. Controversial 32546.
Issues in Social Research Methods. New York: Springer- Webb, Eugene J., Donald T. Campbell, Richard D. Schwartz,
Verlag. and Lee Sechrest. 2000. Unobtrusive Measures. Thousand
Kerlinger, Fred N., and Howard B. Lee. 1999. Foundations of Oaks, Calif.: Sage Publications.
Behavioral Research, 4th ed. New York: Holt, Rinehart & White, David M., Frank R. Rusch, Alan E. Kazdin, and Do-
Winston. nald P. Hartmann. 1989. Application of Meta-Analysis in
Kline, Rex B. 1998. Principles and Practices of Structural Individual-Subject Research. Behavioral Assessment 11:2
Equation Modeling. New York: Guilford Press. 8196.
3
STATISTICAL CONSIDERATIONS

LARRY V. HEDGES
Northestern University

CONTENTS
3.1 Introduction 38
3.2 Problem Formulation 38
3.2.1 Universe of Generalization 38
3.2.1.1 Conditional Inference Model 38
3.2.1.2 Unconditional Inference Model 39
3.2.1.3 Fixed Versus Random Effects 41
3.2.1.4 Using Empirical Heterogeneity 41
3.2.2 Theoretical Versus Operational Parameters 41
3.2.3 Number and Source of Hypotheses 42
3.3. Data Collection 43
3.3.1 Representativeness 43
3.3.2 Dependence 44
3.3.3 Publication Selection 44
3.4 Data Analysis 44
3.4.1 Heterogeneity 44
3.4.1 Unity of Statistical Methods 45
3.4.2 Large Sample Approximations 45
3.5 Conclusion 46
3.6 References 46

37
38 FORMULATING A PROBLEM

when there is in fact a common population effect size, the


3.1 INTRODUCTION underlying inference models remain distinct.
Research synthesis is an empirical process. As with any I argue that the most important issue in determining
empirical research, statistical considerations have an in- statistical procedure should be the nature of the inference
fluence at many points in the process. Some of these, desired, in particular the universe to which one wishes
such as how to estimate a particular effect parameter or to generalize. If the analyst wishes to make inferences
establish its sampling uncertainty, are narrowly matters only about the effect-size parameters in the set of studies
of statistical practice. They are considered in detail in that are observed (or to a set of studies identical to the
subsequent chapters of this handbook. Other issues are observed studies except for uncertainty associated with
more conceptual and might best be considered statistical the sampling of subjects into those studies), this is what
considerations that impinge on general matters of re- I will call a conditional inference. One might say that
search strategy or interpretation. This chapter addresses conditional inferences about the observed effect sizes are
selected issues related to interpretation. intended to be robust to the consequences of sampling
error associated with sampling of subjects (from the same
populations) into studies. Strictly, conditional inferences
3.2 PROBLEM FORMULATION apply to this collection of studies and say nothing about
other studies that may be done later, could have been
The formulation of the research synthesis problem has done earlier, or may have already been done but are not
important implications for the statistical methods that included among the observed studies. Fixed-effects sta-
may be appropriate and for the interpretation of results. tistical procedures can be appropriate for making condi-
Careful consideration of the questions to be addressed in tional inferences.
the synthesis will also have implications for data collec- In contrast, the analyst may wish to make a different
tion, data evaluation, and presentation of results. In this kind of inference, one that embodies an explicit general-
section I discuss two broad considerations in problem for- ization beyond the observed studies. In this case the stud-
mulation: universe to which generalizations are referred ies that are observed are not the only studies that might be
and the number and source of hypotheses addressed. of interest. Indeed one might say that the studies observed
are of interest only because they reveal something about
3.2.1 Universe of Generalization a putative population of studies that is the real object of
inference. If the analyst wishes to make inferences about
A central aspect of statistical inference is making a gen- the parameters of a population of studies that is larger than
eralization from an observed sample to a larger popula- the set of observed studies and that may not be strictly
tion or universe of generalization. Statistical methods are identical to them, I call this an unconditional inference.
chosen to facilitate valid inferences to the universe of Random-effects (or mixed-effects) analysis procedures
generalization. There has been a great deal of confusion are designed to facilitate unconditional inferences.
about the choice between fixed- and random-effects statis- 3.2.1.1 Conditional Inference Model In the condi-
tical methods in meta-analysis. Although it is not always tional inference model, the universe to which generaliza-
conceived in terms of inference models, the universe of tions are made consists of ensembles of studies identical
generalization is the central conceptual feature that deter- to those in the study sample except for the particular peo-
mines the difference between these methods. ple (or primary sampling units) that appear in the studies.
The choice between fixed- and random-effects proce- Thus the studies in the universe differ from those in the
dures has sometimes been framed as entirely a question study sample only as a result of sampling of people into
of homogeneity of the effect-size parameters. That is if the groups of the studies. The only source of sampling
all of the studies estimate a common effect-size parame- error or uncertainty is therefore the variation resulting
ter, then fixed-effects analyses are appropriate. However from the sampling of people into studies.
if there is evidence of heterogeneity among the popula- In a strict sense, the universe is structuredit is a col-
tion effects estimated by the various studies, or heteroge- lection of identical ensembles of studies, each study in an
neity that remains after conditioning on covariates, then ensemble corresponding to a particular study in each of the
random-effects procedures should be used. Although other ensembles. Each of the corresponding studies would
fixed- and random-effects analyses give similar answers have exactly the same effect-size parameter (population
STATISTICAL CONSIDERATIONS 39

effect size). In fact, part of the definition of identical (in argument. One is that a level not in the sample is suffi-
the requirement that corresponding studies in different ciently like a level or levels in the sample that it is essen-
ensembles of this universe be identical) is that they have tially judged identical to them (for example, this instance
the same effect-size parameter. of behavior modification treatment is essentially the same
The model is called conditional because it can be as the treatment labeled behavior modification in the
conceived as a model that holds fixed, or conditions on, sample, or this treatment of twenty weeks is essentially
the characteristics of studies that might be related to the identical to the treatment of twenty weeks in the sample).
effect-size parameter. The conditional model in research The other variation of the argument is that a level that is
synthesis is in the same spirit as the usual regression not in the sample lies between values in the sample on
model and fixed-effects analysis of variance in primary some implicit or explicit dimension and thus it is safe to
research. In the case of regression, the fixed effect refers interpolate between the results obtained for levels in the
to the fact that the values of the predictor variables are sample. For example, suppose we have the predictor val-
taken to be fixed (not randomly sampled). The only ues ten and twenty in the sample, then a new case with
source of variation that enters into the uncertainty of esti- predictor value fifteen might reasonably have an outcome
mates of the regression coefficients or tests of hypotheses half way between that for the two sampled values, or
is that due to the sampling of individuals (in particular, a new treatment might be judged between two others in
their outcome variable scores) with a given ensemble of intensity and therefore its outcome might reasonably be
predictor variable scores. Uncertainty due to the sampling assumed to be between that of the other two. This inter-
of predictor variable scores themselves is not taken into polation between realized values is sometimes formal-
account. To put it another way, the regression model is ized as a modeling assumption (for example, in linear
conditional on the particular ensemble of values of the regression, where the assumption of a linear equation jus-
predictor variables in the sample. tifies the interpolation or even extrapolation to other data
The situation is similar in the fixed-effects analysis of provided they are sufficiently similar to fit the same linear
variance. Here the term fixed effects refers to the fact that model). In either case the generalization to levels not
the treatment levels in the experiment are considered present in the sample requires an assumption that the lev-
fixed, and the only source of uncertainty in the tests for els are similar to those in the sampleone not justified
treatment effects is a consequence of sampling of a group by a formal sampling argument.
of individuals (or their outcome scores) within a given Inference to studies not identical to these in the sample
ensemble of treatment levels. Uncertainty attributable to can be justified in meta-analysis by the same intellectual
the sampling of treatment levels from a collection of pos- devices used to justify the corresponding inferences in
sible treatment levels is not taken into account. primary research. Specifically, inferences may be justi-
Inference to Other Cases. In conditional models, (in- fied if the studies are judged a priori to be similar enough
cluding regression and analysis of variance (ANOVA) to those in the study sample. It is important to recognize,
inferences, are, in the strictest sense, limited to cases in however, that the inference process has two distinct parts.
which the ensemble of values of the predictor variables One part is the generalization from the study sample to a
are represented in the sample. Of course, conditional universe of identical studies, which is supported by a
models are widely used in primary research and the gen- sampling theory rationale. The second part is the general-
eralizations supported typically are not constrained to ization from the universe of studies that are identical to
predictor values in the sample. For example, generaliza- the sample to a universe of similar enough (but not identi-
tions about treatment effects in fixed-effects ANOVA cal) studies. This second part is not strictly supported by
are usually not constrained to apply only to the precise a sampling argument but by an extra-statistical one.
levels of treatment found in the experiment, but are 3.2.1.2 Unconditional Inference Model In the un-
viewed as applying to similar treatments as well even if conditional model, the study sample is presumed to be
they were not explicitly part of the experiment. How are literally a sample from a hypothetical collection (or pop-
such inferences justified? Typically they are justified on ulation) of studies. The universe to which generalizations
the basis of an a priori (extra-empirical) decision that are made consists of a population of studies from which
other levels (other treatments or ensembles of predictor the study sample is drawn. Studies in this universe differ
values) are sufficiently like those in the sample that their from those in the study sample along two dimensions. First
behavior will be identical. There are two variations of the the studies differ from one another in study characteristics
40 FORMULATING A PROBLEM

and in effect-size parameter. The generalization is not, as argument. Because the universe contains studies that dif-
it was in the fixed-effects case, to a universe consisting of fer in their characteristics, and those differences find their
ensembles of studies with corresponding members of the way into the study sample by the process of random se-
ensembles having identical characteristics and effect-size lection, generalizations to the universe pertain to studies
parameters. Instead the studies in the study sample (and that are not identical to those in the study sample.
their effect-size parameters) differ from those in the uni- By using a sampling model of generalization, the
verse by as much as might be expected as a consequence random-effects model seems to avoid subjective difficul-
of drawing a sample from a population. Second, in addi- ties that plagued the fixed-effects model in generaliza-
tion to differences in study characteristics and effect-size tions to studies not identical to the study sample. That
parameters, the studies in the study sample also differ is, we do not have to ask, How similar is similar enough?
from those in the universe as a consequence of the sam- Instead we substitute another question: Is this new study
pling of people into the groups of the study. This results part of the universe from which the study sample was
in variation of observed effect sizes about their respective obtained? If study samples were obtained from well de-
effect-size parameters. fined sampling frames via overtly specified sampling
I can conceive of the two dimensions indicated above schemes, this might be an easy question to answer. This
as introducing two sources of variability into the observed however is virtually never the case in meta-analysis (and
(sample) effect sizes in the universe. One is due to varia- is unusual in other applications of random-effects mod-
tion in observed (or potentially observable) study effect els). The universe is usually rather ambiguously specified
sizes about their effect-size parameters. This source of and consequently the ambiguity in generalization based
variability is a result of the sampling of people into stud- on random-effects models is that it is difficult to know
ies and is the only source of variability conceived as ran- precisely what the universe is. In contrast, the universe is
dom in the conditional model. The other is one due to clear in fixed-effects models, but the ambiguity arises in
variation in effect-size parameters across studies. deciding if a new study might be similar enough to the
This model is called the unconditional model because, studies already contained in the study sample.
unlike the conditional model, it does not condition (or The random-effects model does provide the technical
hold fixed) characteristics of studies that might be related means to address an important problem that is not han-
to the effect-size parameter. The random-effects model in dled in the fixed-effects model, namely the additional un-
research synthesis is in the same spirit as the correlation certainty introduced by the inference to studies that are
model or the random-effects analysis of variance in pri- not identical (except for the sample of people involved) to
mary research. In the correlation model both the values those in the study sample. Inference to (nonsampled)
of predictor variable and those of the dependent variable studies in the fixed-effects model occurs outside of the
are considered to be sampled from a populationin this technical framework and hence any uncertainty it con-
case one with a joint distribution. In the random-effects tributes cannot be evaluated by technical means within
analysis of variance, the levels of the treatment factor the model. In contrast, the random-effects model does in-
are sampled from a universe of possible treatment levels corporate between-study variation into the sampling un-
(and consequently, the corresponding treatment effect certainty used to compute tests and estimates.
parameters are sampled from a universe of treatment ef- Although the random-effects model has the advantage
fect parameters). There are two sources of uncertainty in of incorporating inferences to a universe of studies exhib-
estimates and tests in random-effects analyses. One at- iting variation in their characteristics, the definition of the
tributable to sampling of the treatment effect parameters universe may be ambiguous. A tautological universe defi-
themselves and the other to the sampling of individuals nition could be derived by using the sample of studies to
(in particular, outcome scores) into each treatment. define the universe as a universe from which the study
Inference to Other Cases. In the unconditional model, sample is representative. Such a population definition re-
inferences are not limited to cases with predictor vari- mains ambiguous; moreover, it may not be the universe
ables represented in the sample. Instead the inferences, definition desired for the use of the information produced
for example, about the mean or variance of an effect-size by the synthesis. For example, if the study sample includes
parameter, apply to the universe of studies from which many studies of short duration, high intensity treatments,
the study sample was obtained. In effect, the warrant for but the likely practical applications usually involve low-
generalization to other studies is via a classical sampling intensity, long duration treatments, the universe defined
STATISTICAL CONSIDERATIONS 41

implicitly by the study sample may not be the universe procedure should be driven by the inference model, some
most relevant to applications. researchers make that choice based on the outcome of a
One potential solution to this problem might be to ex- statistical test of heterogeneity of effect sizes. Larry
plicitly define a structured universe in terms of study Hedges and Jack Vevea, who studied the properties of
characteristics, and to consider the study sample as a such tests for both conditional and unconditional infer-
stratified sample from this universe. Estimates of parame- ences, called this a conditionally random-effects analysis
ters describing this universe could be obtained by weight- (1998). They found that the type I error rate of condition-
ing each stratum appropriately. For example, if one-half ally random-effects analyses were in between those of
of the studies in the universe are long-duration studies, fixed- and random-effects tests, being slightly inferior to
but only one-third of the study sample have this charac- fixed-effects tests for conditional inferences (but better
teristic, the results of each long-duration study must be than random-effects tests) and slightly inferior to random-
weighted twice as much as the short-duration studies. effects tests for unconditional inferences (but better than
3.2.1.3 Fixed Versus Random Effects The choice fixed-effects tests). Whatever the technical performance
between fixed (conditional) or random (unconditional) of conditionally random-effects tests, they have the dis-
modeling strategies arises in many settings in statistics advantage that they permit the user to avoid a clear choice
and has caused lengthy debates because it involves sub- of inference population, or worse to allow the data to (im-
tleties of how to formulate questions in scientific research plicitly) make that determination depending on the out-
and what data are relevant to answering questions. For come of a statistical test of heterogeneity.
example, the debate between R. A. Fisher and Yates ver-
sus Pearson on whether to condition on the marginal fre- 3.2.2 Theoretical Versus Operational Parameters
quencies in the analysis of 2 2 tables is about precisely
this issue (see Camilli 1990), as is the debate concerning Another fundamental issue in problem formulation
whether word stimuli should be treated as fixed or ran- concerns the nature of the effect-size parameter to be
dom effects in psycho-linguistics (see Clark 1972). estimated. The issue can best be described in terms of
Those who advocated the fixed-effects position argue population parameters (though each parameter has a cor-
that the basis for scientific inference should be only the responding sample estimate). Consider an actual study in
studies that have actually been conducted and are ob- which the effect-size parameter represents the true or
served in the study sample (for example, Peto 1987). Sta- population relationship between variables measured in
tistical methods should be employed only to determine the study. This effect size-parameter may be systemati-
the chance consequences of sampling of people into these cally affected by artifactual sources of bias such as re-
(the observed) studies. Thus they would emphasize esti- striction of range or measurement error in the dependent
mation and hypothesis testing for (or conditional on) this variable. Corresponding to this actual study, one can
collection of studies. If we must generalize to other stud- imagine a hypothetical study in which the biases due to
ies, they would argue that this is best done by subjective artifacts were controlled or eliminated. The effect size-
or extra-statistical procedures. parameter in this hypothetical study would differ from
Those who advocate the random-effects perspective that of the actual study because the biases from artifacts
argue that the particular studies we observe are, to some of design would not be present. A key distinction is be-
extent, an accident of chance. The important inference tween a theoretical effect size (reflecting a relationship
question is not what is true about these studies, but what between variables in a hypothetical study) and an opera-
is true about studies like these that could have been done. tional effect size (the parameter that describes the popula-
They would emphasize the generalization to other studies tion relationship between variables in an actual study)
or other situations that could have been studied and that (see chapter 18, this volume). Theoretical effect sizes are
these generalizations should be handled by formal statis- often conceived as those corrected for bias or for some
tical methods. In many situations where research is used aspect of experimental procedure (such as a restrictive
to inform public policy by providing information about sampling plan or use of an unreliable outcome measure)
the likely effects of treatments in situations that have not that can systematically influence effect size. Operational
been explicitly studied, this argument seems persuasive. effect-size parameters, in contrast, are often conceived as
3.2.1.4 Using Empirical Heterogeneity Although affected by whatever bias or aspects of procedure that
most statisticians would argue that the choice of analysis happen to be present in a particular study.
42 FORMULATING A PROBLEM

Perhaps the most prominent example of a theoretical theoretical correlation. Another example might be the es-
effect size is the population correlation coefficient cor- timation of the standardized mean difference of a treat-
rected for attenuation due to measurement error and re- ment intended for a general population of people, but
striction of range. One can also conceive of this corrected which has typically been investigated with studies using
coefficient as the population correlation between true more restricted groups of people. Since the scores in the
scores in a population where neither variable is subject to individual studies have a smaller standard deviation than
restriction of range. The operational effect size is the cor- the general population, the effect sizes will be artifactually
relation parameter between observed scores in the popu- largethat is if the treatment produced the same change
lation in which variables have restricted ranges. Because in raw score units, dividing by the smaller standard devia-
the relation between the attenuated (operational) correla- tions in the study sample would make the standardized
tion and disattenuated (theoretical) correlation is known, difference look artifactually large. Hence a theoretical
it is possible to convert operational effect sizes into theo- effect size might be chosenthe effect size that would
retical effect sizes. have been obtained if the outcome scores in each study
Most research syntheses make use of operational effect had the same variation as the general population. This
size. However there are two reasons why theoretical ef- would lead to corrections for sampling variability. Cor-
fect sizes are sometimes used. One is to enhance the com- rections of this sort are discussed in chapter 17.
parability and hence combinability of estimates from
studies whose operational effect sizes would otherwise 3.2.3 Number and Source of Hypotheses
be influenced quite substantially (and differently) by bi-
ases or incidental features of study design or procedure. Richard Light and David Pillemer distinguished between
This has sometimes been characterized as putting all of two types of questions that might be asked in a research
the effect sizes on the same metric (Glass, McGaw, and synthesis (1984). One kind of question concerns a hy-
Smith 1981, 116). For example, in research on personal pothesis that is specified precisely in advance, for exam-
selection, virtually all studies involve restriction of range ple, whether on average the treatment works. The other
which attenuates correlations (see Hunter and Schmidt type of question is specified only vaguely (for example,
2004). Moreover, the amount of restriction of range typi- under what conditions does the treatment work best).
cally varies substantially across studies. Hence correction This distinction in problem specification is similar to that
for restriction of range ensures that each study provides between planned and post hoc comparisons in the analy-
an estimate of the same kind of correlationthe correla- sis of variance which is familiar to many researchers.
tion in a population having an unrestricted distribution Although either kind of question is legitimate in research
of test scores. Because restriction of range and many synthesis, the calculation of levels of statistical signifi-
other consequences of design are incidental features of cance may be affected by whether the question was de-
the studies, disattenuation to remove their effects is some- fined in advance or discovered post hoc by examination
times called artifact correction. of the data.
A more controversial reason that theoretical effect sizes Although it is useful to distinguish the two cases, sharp
are sometimes used is because the theoretical effect distinctions are not possible in practice. Some of the
sizes are considered more scientifically relevant. For research literature is surely known to the reviewer prior
example, to estimate the benefit of scientific personnel se- to the synthesis, thus putatively a priori hypotheses are
lection using cognitive tests versus selection on an effec- likely to have been influenced by the data. Conversely,
tively random basis, we would need to compare the per- hypotheses derived during exploration of the data may
formance of applicants with the full range of test scores have been conceived earlier and proposed for explicit
(those selected at random) with that of applicants selected testing because the examination of the data suggested
via the testapplicants who would have a restricted range that they might be fruitful given the data. Perhaps the
of test scores. Although a study of the validity of a selec- greatest ambiguity arises when a very large number of
tion test would compute the correlation between test score hypotheses are proposed a priori for testing. In this case it
and job performance based on the restricted sample, the is difficult to distinguish between hypotheses selected by
correlation that reflects the effectiveness of the test in searching the data informally, then proposing a hypothe-
predicting job performance is the correlation that would sis a posteriori and simply proposing all possible hypoth-
have been obtained with the full range of test scoresa eses a priori. For this reason it may be sensible to treat
STATISTICAL CONSIDERATIONS 43

large numbers of hypotheses as if they were post hoc. procedure must be designed so as to yield studies that are
Despite the difficulty in drawing a sharp distinction, we representative of the intended universe of studies. Ideally
still believe that the conceptual distinction between cases the sampling is carried out in a way that reveals aspects of
in which a few hypotheses are specified in advance and the sampling (like dependence of units in the sample, see
those in which there are many hypotheses (or hypotheses chapter 19, this volume), or selection effects (like publi-
not necessarily specified in advance) is useful. cation bias, see chapter 23, this volume) that might influ-
The primary reason for insisting on this distinction is ence the analytic methods chosen to draw inferences
statistical. When testing hypotheses specified in advance, from the sample.
it is appropriate to consider that test in isolation from
other tests that might have been carried out. This is often 3.3.1 Representativeness
called the use of a testwise error rate in the theory of mul-
tiple comparisons, meaning that the appropriate defini- Given that a universe has been chosen, so that we know
tion of the significance level of the test is the proportion the kind of studies about which the synthesis is to inform
of the time this test, considered in isolation, would yield us, a fundamental problem is ensuring that the study sam-
a type I error. ple is selected in such a way that it supports inferences to
In contrast, when testing a hypothesis derived after ex- that universe. Part of the problem is to ensure that search
ploring the data, it may not be appropriate to consider the criteria are consistent with the universe definition. Given
test in isolation from other tests that might have been that search criteria are consistent with the universe defini-
done. For example, by choosing to test the most promis- tion, an exhaustive sample of studies that meet the criteria
ing of a set of study characteristics (that is, the one that is often taken to be a representative sample of studies of
appears to be most strongly related to effect size), the re- the universe. However, there are sometimes reasons to
viewer has implicitly used information from tests that view this proposition skeptically.
could have been done on other study characteristics. More The concept of representativeness has always been a
formally, the sampling distribution of the largest relation- somewhat ambiguous idea (see Kruskal and Mosteller
ship is not the same as that of one relationship selected a 1979a, 1979b, 1979c, 1980) but the concept is useful in
priori. In cases where the hypothesis is picked after ex- helping to illuminate potential problems in drawing infer-
ploring the data, special post hoc test procedures that take ences from samples.
account of the other hypotheses that could have been One reason is that some types of studies in the intended
tested are appropriate (see chapter 15; Hedges and Olkin universe may not have been conducted. The act of defin-
1983, 1985; more generally Miller 1981). This is often ing a universe of studies that could be conducted does not
called the use of an experimentwise error rate because the even guarantee that it will be nonempty. We hope that the
appropriate definition of the statistical significance level sample of studies will inform us about a universe of stud-
of the test is the proportion of the time the group of tests ies exhibiting variation in their characteristics, but studies
would lead to selecting a test that made a type I error. with a full range of characteristics may not have been
Post hoc test procedures are frequently much less pow- conducted. The types of studies that have been conducted
erful than their a priori counterparts for detecting a par- therefore limits the possibility of generalizations, what-
ticular relationship that is of interest. On the other hand, ever the search procedures. For example, the universe of
the post hoc procedures can do something that a priori studies might be planned to include studies with both be-
procedures cannot: detect relationships not suspected in havioral observations and other outcome measures. But if
advance. The important point is that there is a trade-off no studies have been conducted using behavioral observa-
between the ability to find relationships that are not sus- tions, no possible sample of studies can, strictly speaking,
pected and the sensitivity to detect those that are thought support generalizations to a universe including studies
to be likely. with behavioral observations. The situation need not be
as simple or obvious as suggested in this example. Often
3.3 DATA COLLECTION the limitations on the types of studies conducted arise at
the level of the joint frequency of two or more character-
Data collection in research synthesis is largely a sampling istics, such as categories of treatment and outcome. In
activity which raises all of the concerns attendant on such situations the limitations of the studies available are
any other sampling activity. Specifically the sampling evident only in the scarcity of studies with certain joint
44 FORMULATING A PROBLEM

characteristics and not in the marginal frequency of any tistically significant). Such missing data would be called
type of study. nonignorable in the Little-Rubin framework and can lead
A second reason that exhaustiveness of sampling may to bias in the results of a meta-analysis. The resulting
not yield a representative sample of the universe is that, biases are typically called publication bias, although the
although studies may have been conducted, they may not more general reporting bias is a more accurate term.
reported in the forums accessible to the reviewer. This is Methods for detecting and adjusting for publication bias
the problem of missing data, as discussed in chapter 21 of are discussed at length in chapter 23 of this volume.
this volume. Selective reporting (at least in forums acces-
sible to synthesists) can occur in many ways: entire stud- 3.4 DATA ANALYSIS
ies may be missing or only certain results from those
studies may be missing, and missingness may, or may not Much of this handbook deals with issues of data analysis
be correlated with study results. The principal point here in research synthesis. It is appropriate here to discuss
is that exhaustive sampling of data bases rendered non- three issues of broad application to all types of statistical
representative by publication or reporting bias does not analyses in research synthesis.
yield a representative sample of the universe intended.
3.4.1 Heterogeneity
3.3.2 Dependence
Heterogeneity of effects in meta-analysis introduces a va-
Several types of dependence may arise in the sampling of riety of interpretational problems that do not exist or are
effect size estimates. The simplest form of dependence simpler if effects are homogeneous. For example, if ef-
arises when several effect size estimates are computed fects are homogeneous, then fixed- and random-effects
from measures from identical or partially overlapping analyses essentially coincide and problems of generaliza-
groups of subjects. Methods for handling such depen- tion are simplified. Homogeneity also simplifies interpre-
dence are discussed in chapter 19. While this form of tation by making the synthesis basically an exercise in
dependence most often arises when several effect size es- simple triangulation with the findings from different stud-
timates are computed from data reported in the same ies behaving as simple replications.
study, it can also arise when several different studies re- Unfortunately, heterogeneity is a rather frequent find-
port data on the same sample of subjects. Failure to rec- ing in research syntheses. This compels the synthesist to
ognize this form of dependence and to use appropriate find ways of representing and interpreting the heteroge-
analytic strategies to cope with it can result in inaccurate neity and dealing with it in statistical analyses. Fortu-
estimates of effects and their standard errors. nately, there has been considerable progress in procedures
A second type of dependence occurs when studies for representing heterogeneity in interpretable ways, such
with similar or identical characteristics exhibit less vari- as theoretically sound measures of the proportion of vari-
ability in their effect-size parameters than does the entire ance in observed effect sizes due to heterogeneity, the I 2
sample of studies. Such an intraclass correlation of study measure (see chapter 14, this volume). Similarly, there
effects leads to misspecification of random-effects mod- has been great progress in both theory and software for
els and hence to erroneous characterizations of between- random and mixed effects analyses that explicitly include
study variation in effects. This form of dependence would heterogeneity in analyses (see chapter 16, this volume).
also suggest misspecification in fixed-effects models if Some major issues arising in connection with hetero-
the study characteristics involved were not part of the geneity are not easily resolvable. For example, heteroge-
formal explanatory model for between-study variation in neity is sometimes a function of the choice of effect size
effects. index. If effect sizes of a set of studies can be expressed
in more than one metric, they are sometimes more consis-
3.3.3 Publication Bias tent when expressed in one metric than another. Which
then (if either) is the appropriate way to characterize the
A particularly pernicious form of missing data occurs heterogeneity of effects? This question has theoretical
when the probability that a result is reported depends implications for statistical analyses (for example, how
on the result obtained (for example, on whether it is sta- necessary is homogeneity for combinability, see Cochran
STATISTICAL CONSIDERATIONS 45

1954; Radhakrishna 1965). However it is more than a a generic standard error S(T ). This permits statistical
theoretical issue since there is empirical evidence that methods to be applied to a collection of any type effect
meta-analyses using some indexes find more heterogene- size estimates by substituting the correct formulas for
ity than others (for example, meta-analyses using odds the individual estimates and their standard errors. This
ratios are often more consistent across studies than risk procedure not only provides a compact presentation of
differences; see Engels et al. 2000). methods for existing indexes of effect size, but provides
the basis for generalization to new indexes that are yet to
3.4.2 Unity of Statistical Methods be used.

Much of the literature on statistical methodology for re- 3.4.3 Large Sample Approximations
search synthesis is conceived as statistical methods for
the analysis of a particular effect size index. Thus, much Virtually all of the statistical methods described in this
of the literature on meta-analysis provides methods for handbook and used in the analysis of effect sizes in re-
combining estimates of odds ratios, or correlation coeffi- search synthesis are based on so-called large-sample
cients, or standardized mean differences. Even though the approximations. This means that, unlike some simple sta-
methods for a particular effect size index might be similar tistical methods such as the t-test for the differences be-
to those for another index they are presented in the litera- tween means or the F-test in analyses of the general linear
ture as essentially different methods. There is however a model, the sampling theory invoked to construct hypoth-
set of underlying statistical theory (see Cochran 1954; esis tests or confidence intervals is not exactly true in
Hedges 1983) that provides a common theoretical justifi- very small samples. The use of large sample statistical
cation for analyses of the effect size measures in common theory is not limited to meta-analysis; in fact, it is used
usefor example, the standardized mean difference, the much more frequently in applied statistics than is exact
correlation coefficient, the log odds ratio, the difference (or small sample) theory, typically because the exact the-
in proportions, and the like. ory is too difficult to develop. For example, Pearsons
Essentially all commonly used statistical methods for chi-square test for simple interactions and log linear pro-
effect size analyses rely on two facts. The first is that cedures for more complex analysis in contingency tables
the effect size estimate or a suitable transformation is are large sample procedures, as are most multivariate test
normally distributed in large samples with a mean of ap- procedures, procedures using LISREL models, item re-
proximately the effect-size parameter. The second is that sponse models, or even Fishers z-transform of the corre-
the standard error of the effect size estimate is a continu- lation coefficient.
ous function of the within-study sample sizes, the effect That large sample procedures are widely used does not
size, and possibly other parameters that can be estimated imply that they are always without problems in any par-
consistently from within-study data. Statistical methods ticular setting. Indeed, one of the major questions that
for different effect size indexes appear to differ primarily must be addressed in any application of large sample the-
because the formulas for the effect size indexes and their ory is whether the large sample approximation is accurate
standard errors differ. enough in samples of the size available to justify its use.
I am mindful of the variety of indexes of effect size In meta-analysis, large sample theory is primarily used
that have been found useful. Chapter 12 is a detailed to obtain the sampling distribution of the sample effect
treatment of the variety of effect size indexes that can be size estimates. The statistical properties of combined es-
applied to studies with continuous outcome measures timates or tests depends on the accuracy of the (approxi-
and gives formulas for their standard errors. Chapter 13 mations) to the sampling distributions of these individual
provides a similar treatment of the variety of different effect size estimates. Fortunately quite a bit is known
effect size measures that are used for studies with cate- about the accuracy of these approximations to the distri-
gorical outcome variables. However, in this handbook butions of effect size estimates.
we have chosen to stress the conceptual unity of statisti- In the cases of the standardized mean difference, the
cal methods for different indexes of effect size by de- large sample theory is quite accurate for sample sizes as
scribing most methods in terms of a generic effect size small as five to ten per group (see Hedges 1981, 1982;
statistic T, its corresponding effect-size parameter , and Hedges and Olkin 1983). In the cases of the correlation
46 FORMULATING A PROBLEM

coefficient, the large sample theory is notoriously inaccu- collection, and are obviously important in data analysis.
rate in samples of less than a few hundred, particularly if The careful consideration of statistical issues throughout
the population correlation is large in magnitude. However, a synthesis can help assure its validity.
the large sample theory for the Fisher z-transformed cor-
relation is typically quite accurate when the sample size 3.6 REFERENCES
is twenty or more. For this reason I usually suggest that
analyses involving correlation coefficients as the effect Camilli, Greg. 1990. The Test of Homogeneity for 2 2
size index be performed using the Fisher z-transforms of Contingency Tables: A Review and some Personal Opi-
the correlations. nions on the Controversy. Psychological Bulletin 108:
The situation with effect size indexes for experiments 13545.
with discrete outcomes is more difficult to characterize. Clark, Herbert H. 1973. The Language-as-Fixed Effect Fal-
The large sample theory for differences in proportions lacy: A Critique of Language Statistics in Psychological
and for odds ratios usually seems to be reasonably accu- Research. Journal of Verbal and Learning Behavior 12(4):
rate when sample sizes are moderate (for example, greater 33559.
than fifty) as long as the proportions involved are not too Cochran, William G. 1954. The Combination of Estimates
near zero or one. If the proportions are particularly near from Different Experiments. Biometrics 10(1): 10129.
zero or one, larger sample sizes may be needed to insure Engels, Eric A., Christopher C. Schmid, Norma Terrin, In-
comparable accuracy. gram Olkin, and Joseph Lau. 2000. Heterogeneity and
A final technical point about the notion of large sample Statistical Significance in Meta-Analysis: An Empirical
theory in meta-analysis concerns the dual meaning of Study of 125 Meta-Analyses. Statistics in Medicine 19(13):
the term. The total sample size N may be thought of as the 1707728.
sum of the sample sizes across k studies included in the Glass, Gene V., Barry McGaw, and Mary L. Smith. 1981.
synthesis. In studies comparing a treatment group with Meta-Analysis in Social Research. Beverly Hills, Calif.:
sample size niE in the ith study and a control group with Sage Publications.
sample size niC in the ith study, the total sample size is Hedges, Larry V. 1983. Combining Independent Estimators
in Research Synthesis. British Journal of Mathematical
N (nEi nCi ). and Statistical Psychology 36: 12331.
Hedges, Larry V. 1981. Distribution Theory for Glasss Es-
timator of Effect Size and Related Estimators. Journal of
The formal statistical theory underlying most meta-ana-
Educational Statistics 6(2): 10728.
lytic methods is based on large sample approximations
Hedges, Larry V. 1982. Estimating Effect Size from a Se-
that hold when N is large in such a way that all of n1E, n1C,
ries of Independent Experiments. Psychological Bulletin
n2E, n2C, , nkE, nkC are also large.
92: 49099.
Formally, large sample theory describes the behavior
Hedges, Larry V., and Ingram Olkin. 1983. Clustering Esti-
of the limiting distribution as N in such a way that
mates of Effect Magnitude. Psychological Bulletin 93:
nE/N and nC/N are fixed as N increases. Much of this the-
56373.
ory is not true when N by letting k increase and keep-
. 1985. Statistical Methods for Meta-Analysis. New
ing the within-study sample sizes nE and nC small (see
York: Academic Press.
Neyman and Scott 1948). In most practical situations,
Hedges, Larry V. and Jack L. Vevea. 1998. Fixed and Ran-
this distinction is not important because the within-study
dom Effects Models in Meta-Analysis. Psychological
sample sizes are large enough to support the assumption
Methods 3(4): 486504.
that all nE and nC are large. In unusual cases where all of
Hunter, John E. and Frank L. Schmidt. 2004. Methods of
the sample sizes are very small, real caution is required.
Meta-Analysis. Thousand Oaks, Calif.: Sage Publications.
Kruskal, William, and Fred Mosteller. 1979a. Representa-
3.5 CONCLUSION tive Sampling I: Nonscientific Literature. International
Statistical Review 47(1): 1324.
Statistical thinking is important in every stage of research . 1979b. Representative Sampling, II: Scientific Li-
synthesis, as it is in primary research. Statistical issues terature, Excluding Statistics. International Statistical Re-
are part of the problem definition, play a key role in data view 47(2): 11127.
STATISTICAL CONSIDERATIONS 47

. 1979c. Representative Sampling III: The Current Neymann, Jerzy, and Elizabeth L. Scott. 1948. Consistent
Statistical Literature. International Statistical Review Estimates Based on Partially Consistent Observations.
47(3): 24565. Econometrika 16(1): 132.
. 1980. Representative Sampling IV: The History of Peto, Richard. 1987. Why Do We Need Systematic Over-
the Concept in Statistics, 18951939. International Statis- views of Randomized Traits with Discussion. Statistics in
tical Review 48(2): 16995. Medicine 6(3): 23344.
Light, Richard J., and David Pillemar. 1984. Summing Up. Radhakrishna, S. 1965. Combination of the Results from-
Cambridge, Mass.: Harvard University Press. Several 2 2 Contingency Tables. Biometrics 21(1):
Miller, Rupert D. 1981. Simultaneous Statistical Inferences. 8694.
New York: Springer Verlag.
4
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL

HOWARD D. WHITE
Drexel University

CONTENTS
4.1 Communing with the Literature 52
4.2 The Reviewers Progress 52
4.3 The Web and Documentary Evidence 53
4.4 The Cochrane Collaboration 53
4.5 The Campbell Collaboration 55
4.6 Recall and Precision 56
4.7 Improving the Yield 58
4.8 Footnote Chasing 59
4.9 Consultation 60
4.10 Searches with Subject Indexing 62
4.10.1 Natural Language Terms 63
4.10.2 Controlled Vocabulary 63
4.10.3 Journal Names 65
4.11 Browsing 65
4.12 Citation Searches 65
4.13 Final Operations 67
4.13.1 Judging Relevance 67
4.13.2 Document Delivery 67
4.14 References 68

51
52 SEARCHING THE LITERATURE

4.1 COMMUNING WITH THE LITERATURE the establishment of things not claimed explicitly in any
of the documents surveyed (11). Greg Myers held that
Following analysts such as Russell Ackoff and his col- such activity also serves the well-known rhetorical end of
leagues (1976), it is convenient to divide scientific com- persuasion: The writer of a review shapes the literature
munication into four modes: informal oral, informal of a field into a story in order to enlist the support of read-
written, formal oral, and formal written. The first two, ers to continue that story (1991, 45).
exemplified by telephone conversations and mail among
colleagues, are relatively free-form and private. They are 4.2 THE REVIEWERS PROGRESS
undoubtedly important, particularly as ways of sharing
news and of receiving preliminary feedback on profes- As it happens, not everyone has been glad to do the
sional work (Menzel 1968; Garvey and Griffith 1968; shaping. In 1994, when this chapter first appeared, the in-
Griffith and Miller 1970). Only through the formal modes, troductory section was subtitled The Reviewers Bur-
however, can scientists achieve their true goal, which is den to highlight the oft-perceived difficulty of getting
recognition for claims of new knowledge in a cumulative scientists to write literature reviews as part of their com-
enterprise. The cost to them is the effort needed to pre- municative duties. The Committee on Scientific and Tech-
pare claims for public delivery, especially to critical nical Communication quoted one reviewer as saying,
peers. Of all the modes, formal written communication is Digesting the material so that it could be presented on
the most premeditated (even more so than a formal oral some conceptual basis was plain torture; I spent over 200
presentation, such as a briefing or lecture). Its typical hours on that job. I wonder if 200 people spent even one
productspapers, articles, monographs, and reports hour reading it (1969, 181). A decade later, Charles Ber-
constitute the primary literatures by which scientists lay nier and Neil Yerkey observed: Not enough reviews are
open their work to permanent public scrutiny in hopes written, because of the time required to write them and
that their claims to new knowledge will be validated and because of the trauma sometimes experienced during the
esteemed. This holds whether the literatures are distrib- writing of an excellent critical review (1979, 4849).
uted in printed or digital form. Moreover, the act of read- Still later, Eugene Garfield noted a further impediment
ing is no less a form of scientific communication than the view among some scientists that a review article
talking or writing; it might be called communing with the even one that is highly citedis a lesser achievement
literature. than a piece of original research (1989, 113).
The research synthesis emerges at a late stage in for- My summary in the original chapter was that the
mal written communication, after scores or even hun- typical review involves tasks that many scientists find
dreds of primary writings have appeared. Among the irksomean ambitious literature search, obtaining of
synthesists tasks, three stand out: discovery and retrieval documents, extensive reading, reconciliation of conflict-
of primary works, critically evaluating their claims in ing claims, preparation of citationsand the bulk of
light of theory, and synthesizing their essentials so as to effort is centered on the work of others, as if one had to
conserve the readers time. All three tasks require the write chapter 2 of ones dissertation all over again (1994,
judgment of one trained in the research tradition (the the- 42). Even so, it was necessary to add: Against that back-
ory or methodology) under study. As Patrick Wilson put drop, the present movement in research synthesis is an
it, The surveyor must be, or be prepared to become, a intriguing development. The new research synthesists, far
specialist in the subject matter being surveyed (1977, 17). from avoiding the literature search, encourage consider-
But the payoff, he argued, can be rich: The striking thing ation of all empirical studies on a subjectnot only the
about the process of evaluation of a body of work is that, published but the unpublished onesso as to capture in
while the intent is not to increase knowledge by the con- their syntheses the full range of reported statistical effects
ducting of independent inquiries, the result may be the (Rosenthal 1984; Green and Hall 1984) (42).
increase of knowledge, by the drawing of conclusions not The new research synthesists were of course the meta-
made in the literature reviewed but supported by the part analysts, and their tradition is now a well-established
of it judged valid. The process of analysis and synthesis alternative to the older narrative tradition (Hunt 1997;
can produce new knowledge in just this sense, that the at- Petticrew 2001). Rather than shunning the rigors of the
tempt to put down what can be said to be known, on the narrative review, meta-analysts seek to make it more rig-
basis of a given collection of documents, may result in orous, with at least the level of statistical sophistication
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 53

found in primary research. Where narrative reviewers has become a movement, especially in medicine and
have been mute on how they found the studies under con- healthcare but also in quantitative psychology, education,
sideration, the research synthesists call for making ones social work, and public policy studies. Naturally, this
sources and search strategies explicit. Where narrative movement has converged and flourished with the some-
reviewers accept or reject studies impressionistically, what older research synthesis movement (Cooper 2000,
research synthesists want firm editorial criteria. Where appendix), as represented by this handbook. The meta-
narrative reviewers are inconsistent in deciding which as- analysts use special statistical techniques to test evidence
pects of studies to discuss, the research synthesists require from multiple experimental studies and then synthesize
consistent coding of attributes across studies. Where nar- and publish the results in relatively short articles. Mem-
rative reviewers use ad hoc judgments as to the meaning bers of the evidence-based community are their ideal
of statistical findings, the meta-analysts insist on formal consumers (compare Doig and Simpson 2003). The two
comparative and summary techniques. Where reviewers movements exemplify symbiotic scientific communica-
in the past may have been able to get by on their method- tion in the Internet era.
ological and statistical skills without extra reading, they
now confront whole new meta-analytic textbooks (Wang 4.4 THE COCHRANE COLLABORATION
and Bushman 1999; Lipsey and Wilson 2001; Hunter and
Schmidt 2004). Meta-analysis has even been extended to Both movements, however, were gathering strength be-
qualitative studies (Jones 2004; Dixon-Woods et al. 2007) fore the Internet took off and would exist even if it had
and nonstatistical integration of findings (Bland, Meurer, not appeared. In their area of scientific communication,
and Maldonado 1995). Moreover, while the meta-analytic a striking organizational improvement also antedates most
method is not without critics (Pawson 2006), the years of the Internets technological advances. It is the Cochrane
since the early 1990s have seen remarkable improve- Collaboration, headquartered in the United Kingdom but
ments for its advocates on several fronts, and it is only international, which began in 1993. Cooper (2000) gives
fair to sketch them in a new introduction. the background:

4.3 THE WEB AND DOCUMENTARY EVIDENCE In 1979, Archie Cochrane, a British epidemiologist, noted
that a serious criticism of his field was that it had not orga-
The Internet and the Web have spectacularly advanced nized critical summaries of relevant randomized controlled
science in its major enterprise, the communication of trials (RCTs). In 1987, Cochrane found an example in
warranted findings (Ziman 1968). They have sped up in- health care of the kind of synthesis he was looking for. He
called this systematic review of care during pregnancy and
terpersonal messaging (e-mail, file transfer, listservs),
childbirth a real milestone in the history of randomized
expanded the resources for publication (e-journals, pre- trials and in the evaluation of care, and suggested that
print archives, data repositories, pdf files, informational other specialties should copy the methods (Cochrane
websites), created limitless library space, and, most im- 1989). In the same year, the scientific quality of many pub-
portant, brought swift retrieval to vast stores of docu- lished syntheses in medicine was shown to leave much to
ments. The full reaches of the Web and its specialized be desired (Mulrow 1987).
databases are accessible day or night through search en-
gines, notably Google. In many cases, one can now pass The Cochrane Collaboration was developed in response
quickly from bibliographic listings to the full texts they to the call for systematic, up-to-date syntheses of RCTs of
represent, closing a centuries-old gulf. Retrievals, more- healthcare practices. Funds were provided by the United
over, are routinely based on all content-bearing words Kingdoms National Health Service to establish the first
and phrases in documents and not merely on global de- Cochrane Center. When the center opened at Oxford in
scriptions like titles and subject headings. 1992, those involved expressed the hope that there would
The new technologies for literature retrieval have made be a collaborative international response to Cochranes
it easier for fact-hungry professionals to retrieve some- agenda. This idea was outlined at a meeting organized six
thing they especially value: documentary evidence for months later by the New York Academy of Sciences. In
claims, preferably statistical evidence from controlled October 1993, at what was to become the first in a series
experimental trials (Mulrow and Cook 1998; Pettiti 2000; of annual Cochrane Colloquia, seventy-seven people from
Egger, Smith, and Altman 2001). Evidence-based practice eleven countries co-founded the Cochrane Collaboration.
54 SEARCHING THE LITERATURE

A central idea of the founders is that, in matters of should be considered. The value of using citation
health, the ability to find studies involving RCTs is too tracking has not yet been established, but it may be
important to be left to the older, relatively haphazard especially useful to identify additional studies on
methods. To produce what they call systematic reviews topics that are poorly indexed in MEDLINE and
another term for the integrative review or research syn- EMBASE.
thesisthey have built a sociotechnical infrastructure for
There was a time when the way to protect a synthesists
assembling relevant materials. Criteria for selecting stud-
literature search from criticism was simply not to reveal
ies for meta-analysis, and for reporting results are now
what had been done. For example, Gregg Jackson (1980)
explicit. Teams of specialists known as review groups are
surveyed thirty-six research syntheses in which only one
named to monitor literatures and to contribute their
stated the indexes (such as Psychological Abstracts) used
choices to several databases (the Cochrane Library) that
in the search and only three mentioned drawing upon the
are in effect large-scale instruments of quality control in
bibliographies of previous review articles. The failure of
meta-analytic work. The Library includes databases of
almost all integrative review articles to give information
systematic reviews already done, of methodology re-
indicating the thoroughness of the search for appropriate
views, of abstracts of reviews of effects, and of controlled
primary sources, Jackson wrote, does suggest that nei-
trials available for meta-analysis.
ther the reviewers nor their editors attach a great deal of
The 265-page Cochrane Handbook for Systematic Re-
importance to such thoroughness (444). The different
views of Interventions, which is available online, pre-
standards of today are seen even in the abstracts of system-
scribes a set of procedures for retrieving and analyzing
atic reviews, many of which mention the literature search
items from these databases (Higgins and Green 2005).
underlying them. The protocol-based method section of an
Websites such as that of the University of Yorks Centre
abstract by Rob Smeets and his colleagues illustrates:
for Reviews and Dissemination back up its precepts on
literature retrieval. The Centre offers detailed guidance
Method. Systematic literature search in PUBMED, MED-
on how to cover many bases in searching, including the
LINE, EMBASE and PsycINFO until December 2004 to
hand searching of journals for items that computerized identify observational studies regarding deconditioning
search strategies may miss (2005). signs and high quality RCTs regarding the effectiveness of
A typical set of injunctions for retrieval of medical lit- cardiovascular and/or muscle strengthening exercises. In-
erature in the specialty of back and neck pain was pre- ternal validity of the RCTs was assessed by using a check-
pared by the Cochrane Back Group (van Tulder et al. list of nine methodology criteria in accordance with the
2003). They recommend: Cochrane Collaboration. (2006, 673)

1. A computer-aided search of the MEDLINE and


This is a good example of scientific communication
EMBASE databases since their beginning. The
interacting with literature retrieval to produce change
highly sensitive search strategies for retrieval of re-
(but see also Branwell and Williams 1997).
ports of controlled trials should be run in conjunc-
The rigorous Cochrane approach upgrades reviewing in
tion with a specific search for spinal disorders and
general. The various stakeholders in medical and health-
the intervention at issue.
care research, including funders, are likely to want Co-
2. A search of the Cochrane Central Register of Con- chrane standards applied wherever possibleGreshams
trolled Trials (CENTRAL) that is included in the law in reverse. The reviewers themselves seem to accept
latest issue of the Cochrane Library. the higher degree of structure that has been imposed, as
3. Screening references given in relevant systematic demonstrated over the last decade or so by leaping produc-
reviews and identified RCTs. tivity (Lee, Bausell, and Berman 2001). Figure 4.1 pre-
sents data from Thomson Scientifics Web of Science
4. Personal communication with content experts in the specifically, the combined Science Citation Index (SCI)
field. It is left to the discretion of the synthesists to and Social Sciences Citation Index (SSCI) from 1981
identify who the experts are on a specific topic. through 2005that show the number of articles per year
5. Citation tracking of the identified RCTs (use of Sci- with meta-analy*, integrative review, or systematic
ence Citation Index to search forward in time which review in their bibliographic descriptions. They now
subsequent articles have cited the identified RCTs) total more than 23,000, the great majority of them in
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 55

3500 by preparing, maintaining, and promoting access to sys-


tematic reviews of studies on the effects of public policies,
3000 social interventions, and educational practices. (2000)
The Campbell Collaboration was named after the Ameri-
2500 can psychologist and methodologist, Donald Campbell,
Articles per year

who drew attention to the need for society to assess more


2000 rigorously the effects of social and educational experiments.
These experiments take place in education, delinquency
1500 and criminal justice, mental health, welfare, housing, and
employment, among other areas.
1000
Like its predecessor, the Campbell Collaboration of-
500 fers specialized databases to meta-analyststhe C2 So-
cial, Psychological, Education, and Criminological Trials
0 Registry, and the C2 Reviews of Interventions and Policy
1
3
5
7
9
1
3
5
7
9
1
3
5
Evaluationsand has organized what it calls coordinat-
198
198
198
198
198
199
199
199
199
199
200
200
200
ing groups to supervise the preparation of reviews in spe-
Figure 4.1 Growth of Meta-Analytic Literature in SCI cialties of these topical areas.
and SSCI Also like its predecessor, C2 has a high presence on the
Web. For example, the Nordic Campbell Center (2006) is
source: Authors compilation.
strong on explanatory Web pages, such as this definition
of a systematic review:
journals of medicine and allied fields, though many non- Based on one question, a review aims at the most thorough,
medical journals are represented as well. objective, and scientifically based survey possible, col-
lected from all available research literature in the field. The
4.5 THE CAMPBELL COLLABORATION reviews collocate the results of several studies, which are
singled out and evaluated from specific systematic criteria.
Medicine and its allied fields rest on the idea of beneficial In this way, the reviews give a thorough aperu of a specific
intervention in crucial situations. Given that crucial here question posed and answered by a variety of studies. In
can mean life and death, it is easy to see why health pro- some cases, statistical methods (meta analysis) is employed
to analyse and summarize the results of the comprised
fessionals want their decisions to reflect the best evidence
studies. (2006)
available and how the Cochrane Collaboration emerged
as a response. But professionals outside medicine and This accords with my 1994 account, which viewed
healthcare also seek to intervene beneficially in peoples systematic reviews as a way of coping with information
lives, and their interventions may have far-reaching so- overload:
cial and behavioral consequences even if they are not
matters of life and death. The upshot is that, in direct What had been merely too many things to read became a
imitation of the Cochrane, the international Campbell population of studies that could be treated like the respon-
Collaboration (C2) was formed in 1999 to facilitate sys- dents in survey research. Just as pollsters put questions to
tematic reviews of intervention effects in areas familiar to people, the new research synthesists could interview ex-
many of the readers of this handbook. Cooper again: isting studies (Weick 1970) and systematically record their
attributes. Moreover, the process could become a team ef-
The inaugural meeting of the Campbell Collaboration was fort, with division and specialization of labor (for example,
held in Philadelphia, Pennsylvania, on February 24 and 25, a librarian retrieves abstracts of studies, a project director
2000. Patterned after the Cochrane Collaboration, and chooses the studies to be read in full and creates the coding
championed by many of the same people, The Campbell sheet, graduate students do the attribute coding). (4243)
Collaboration aims to bring the same quality of systematic
evidence to issues of public policy as the Cochrane does to But whereas I pictured lone teams improvising, the Camp-
health care. It seeks to help policy makers, practitioners, bell Collaboration (2006) has now put forward crisply
consumers, and the general public make informed decisions standardized procedures, as in this table of contents
56 SEARCHING THE LITERATURE

1000 SSCI selectively covers a fair number of medical jour-


900
nals, and some of those are seen as well, but topics extend
beyond medicine and psychology. This is even plainer in
800 the hundreds of journals not shown.
Although the problems of literature retrieval are never
Articles per year

700
entirely solved, the Cochrane and Campbell materials
600
make it likelier they will be addressed in research synthe-
500 sis from now on. It is not that those materials contain
400 anything new about literature searching; librarians, infor-
mation scientists, and individual methodologists have
300 said it all before. The first edition of this handbook said it
200 and so have later textbooks (see, for example, Fink 2005;
100 Petticrew and Roberts 2006). The difference is that power
relations have changed. Whereas the earlier writers could
0 advise but not prescribe, their views are gradually being
1
3
5
7
9
1
3
5
7
9
1
3
5
adopted by scientists who, jointly, do have prescriptive
198
198
198
198
198
199
199
199
199
199
200
200
200
power and who also have a real constituency, the evidence-
based practitioners, to back them up. Because scientists
Figure 4.2 Growth of Meta-Analytic Literature in SSCI
communicate not only results but norms, this new con-
source: Authors compilation. sensus strikes me as a significant advance.

4.6 RECALL AND PRECISION


describing what the methods section of a C2 protocol
should contain: In the vocabulary of information science, for which liter-
ature retrieval is a main concern, research synthesists are
criteria for inclusion and exclusion of studies in the unusually interested in high recall of documents. Recall
review is a measure used in evaluating literature searches: it ex-
search strategy for identification of relevant studies presses (as a percentage) the ratio of relevant documents
retrieved to all those in a collection that should be re-
description of methods used in the component studies trieved. The latter value is, of course, a fiction: if we could
criteria for determination of independent findings identify all existing relevant documents in order to count
them, we could retrieve them all, and so recall would al-
details of study coding categories
ways be 100 percent. Difficulties of operationalization
statistical procedures and conventions aside, recall is a useful fiction in analyzing what users
treatment of qualitative research want from document retrieval systems.
Another major desideratum in retrieval systems is
Despite occasional resistance and demurrals (such as high precision, where precision expresses (as a percent-
Wallace et al. 2006), the new prescriptiveness seems to age) the ratio of documents retrieved and judged relevant
have helped reviewers. Figure 4.1 combined data from to all those actually retrieved. Precision measures how
both SSCI and the much larger SCI. For readers whose many irrelevant documentsfalse positivesone must
specialties are better defined by the Campbell Collabora- go through to find the true positives or hits.
tion than the Cochrane, figure 4.2 breaks out the almost Precision and recall tend to vary inversely. If one seeks
8,000 articles that were covered by SSCI and plots their high recallcomplete or comprehensive searchesone
growth from 1981 to 2005. Again, the impression is of must be prepared to sift through many irrelevant docu-
great vigor in the years since the first edition of this hand- ments, thereby degrading precision. Conversely, retriev-
book and since the founding of C2. als can be made highly precise, so as to cut down on false
For the same SSCI data, journals producing more than positives, but at the cost of missing many relevant docu-
25 articles during 19812005 are rank-ordered in table ments scattered through the literature (false negatives),
4.1. Psychology and psychiatry journals predominate. thereby degrading recall.
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 57

Table 4.1 SSCI Journals with More than Twenty-Five Articles Relevant to Meta-Analysis, 19812005

195 Psychological Bulletin 40 Perceptual and Motor Skills


179 Journal of Applied Psychology 39 Journal of Clinical Psychiatry
136 Schizophrenia Research Association 38 JAMAJournal of the American Medical Association
84 Journal of Advanced Nursing 38 Journal of Vocational Behavior
82 Personnel Psychology 38 Nursing Research
76 Journal of Consulting and Clinical Psychology 35 International Journal of Psychology
68 Review of Educational Research 35 Psychological Medicine
67 American Journal of Psychiatry 34 American Psychologist
66 Addiction 34 Psychological Methods
64 Clinical Psychology Review 34 Psychology and Aging
58 British Journal of Psychiatry 32 Online Journal of Knowledge Synthesis For Nursing
55 British Medical Journal 31 Academy of Management Journal
54 Educational and Psychological Measurement 31 Health Psychology
51 Journal of Personality and Social Psychology 31 Medical Care
49 Clinical PsychologyScience and Practice 31 Sex Roles
49 Journal of Organizational Behavior 30 Personality and Social Psychology Bulletin
45 Acta Psychiatrica Scandinavica 29 Human Relations
44 Evaluation & the Health Professions 29 Journal of Psychosomatic Research
44 Journal of Management 28 American Journal of Public Health
44 Patient Education and Counseling 28 Australian and New Zealand Journal of Psychiatry
42 Psychological Reports 28 Journal of Parapsychology
42 Social Science & Medicine 28 Journal of Research in Science Teaching
42 Value in Health 28 Personality and Individual Differences
41 American Journal of Preventive Medicine 27 Child Development
41 Journal of Affective Disorders 27 Gerontologist
41 Journal of the American Geriatrics Society 27 Psychosomatic Medicine
40 Journal of Applied Social Psychology 26 Accident Analysis and Prevention
40 Journal of Occupational and Organizational Psychology 26 Journal of Business Research

source: Authors compilation.

Most literature searchers actually want high-precision more rigorous standards for uncovering studies than li-
retrievals, preferring relatively little bibliographic output brarians and information specialists typically encounter.
to scan and relatively few items to read at the end of the The only other group likely to be as driven by a need for
judgment process. This is one manifestation of least ef- exhaustiveness are PhD students in the early stages of
fort, an economizing behavior often seen in information their dissertations.
seekers (Mann 1993, 91101). The research synthesists The point is not to track down every paper that is some-
are distinctive in wanting (or at least accepting the need how related to the topic. Research synthesists who reject
for) high recall. As Jackson put it, because there is no this idea are quite sensible. The point is to avoid missing
way of ascertaining whether the set of located studies is a useful paper that lies outside ones regular purview,
representative of the full set of existing studies on the thereby ensuring that ones habitual channels of commu-
topic, the best protection against an unrepresentative set nication will not bias the results of studies obtained by
is to locate as many of the existing studies as is possible the search. In professional matters, most researchers find
(1978, 14). Bert Green and Judith Hall called such at- it hard to believe that their habitual channels, such as sub-
tempts mundane and often tedious but of the utmost scriptions to certain journals and conversations with cer-
importance (1984, 46). Some research synthesists doubt tain colleagues, can fail to keep them fully informed. But
whether comprehensive searches are worth the effort the history of science and scholarship is full of examples
(for example, Laird 1990), but even they seem to have of mutually relevant specialties that were unaware of
58 SEARCHING THE LITERATURE

each other for years because their members construed claim that writings in combination may yield new knowl-
their own literatures too narrowly and failed to ask what edge not found in any of them separately.
other researchersperhaps with different technical vo- To show these experts converging, table 4.2 places
cabularies (Grupp and Heider 1975)were doing in a Coopers and Manns modes verbatim under Wilsons
similar vein. headings. Discussing computer searching, Mann com-
More to the point, it is likely that one has missed bines citation databases with other kinds, and I have put
writings in ones own specialty. The experience of the re- them with his example of citation searches in printed
search team reported in Joel Greenhouse and his col- sources; otherwise, his text is unaltered. Even among
leagues shows that even the sources immediately at hand those who use these techniques, however, it is an open
may produce an unexpectedly large yield (1990). The question as to whether all likely sources are being con-
further lesson to be drawn is that, just as known sources sulted; researchers are often unaware of the full range of
can yield unknown items, there may also be whole sources bibliographic tools available. Nowadays, the Cochrane
of which one is unaware. A case can be made that re- and Campbell databases would be primary places to look,
search synthesists primarily concerned with substantive but they were not available when Cooper was doing his
and methodological issues should get professional advice study, and even now they may not figure in some re-
in literature retrieval, just as they would in statistics or searchers plans.
computing if a problem exceeded their expertise (Wade The retrieval modes in table 4.2 pretty well exhaust the
et al. 2006). possibilities. All ought to be consideredand most
triedby someone striving for high recall of documents
4.7 IMPROVING THE YIELD (Beahler et al. 2000; Conn et al. 2003; Schlosser et al.
2005). The different modes can be used concurrently,
The most obvious way to improve recall is for research- thus saving time. Research synthesists disinclined to go
ers to avail themselves more of bibliographic databases beyond their favorite few can still extend their range by
(often called abstracting and indexing or A and I ser- delegating searches to information specialists, including
vices, or reference databases). Harris Cooper described librarians. This would particularly hold for less favored
fifteen ways in which fifty-seven authors of empirical re- strategies, such as searching reference and citation data-
search reviews actually conducted literature searches bases, browsing library collections, and discovering topi-
(1985, 1987). Most preferred to trace the references in re- cal bibliographies.
view papers, books, and nonreview papers already in The other obvious way to improve recall is to do what
hand (presumably many were from their own files) and to are sometimes called forward searches in the citation in-
ask colleagues for recommendations. These are classic dexes of Thomson Scientific, now available in the Web of
least-effort strategies to minimize work and maximize Science and on Dialog. The footnote chasing entries in
payoff in literature retrievals. For those purposes, they table 4.2 reflect citation searches that go backward in
are unexceptionable; that is why academics and other time; one moves from a known publication to the earlier
intelligent people use them so often. Nevertheless, they items it cites. In the contrasting kind of search in Dialog
are also almost guaranteed to miss relevant writings, or the Web of Science, one moves forward in time by
particularly if literatures are large and heterogeneous. looking up a known publication and finding the later
Moreover, because they are unsystematic, they cannot be items that cite it. Although the citation indexes will not
replicated. infallibly yield new items for ones synthesis, it is fatuous
How many modes of searching are there? In an inde- to ignore themroughly equivalent to not keeping
pendent account from library and information science, abreast of appropriate statistical techniques.
Patrick Wilson identified five major ones (1992). Harris The result of accepting higher retrieval standards is
Cooper illustrated all of them. In Mann (1993), a Library that, rather than veiling ones strategies, one can state
of Congress reference librarian names eight modes as an them candidly in the final report as documentation (Roth-
efficient way to teach people how to do literature retrieval, stein, Turner, and Lavenberg 2004; Campbell Education
but his eight are simply variants on Coopers fifteen. Coordinating Group 2005; Wade et al. 2006). Among
They, too, illustrate Wilsons five categories, even though other benefits, such statements help prevent unnecessary
Thomas Mann wrote independently of either him or Coo- duplication of effort by later searchers, because they can
per. Together, the three accounts illustrate Wilsons 1977 replicate the original searchers results. Whether these
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 59

Table 4.2 Five Major Modes of Searching statements appear in the main text or as an endnote or
appendix, Bates argued that they should include
Footnote Chasing
Cooper 1985 notes on domainall the sources used to identify the
References in review papers written by others studies on which a research synthesis is based (in-
References in books by others cluding sources that failed to yield items although
References in nonreview papers from journals you they initially seemed plausible)
subscribe to
notes on scopesubject headings or other search
References in nonreview papers you browsed through
at the library terms used in the various sources; geographic, tem-
Topical bibliographies compiled by others poral, and language constraints
Mann 1993 notes on selection principleseditorial criteria used
Searches through published bibliographies (including in accepting or rejecting studies to be meta-analyzed.
sets of footnotes in relevant subject documents) (1992a)
Related Records searches
Consultation Because the thoroughness of the search depends on
Cooper 1985 knowledge of the major modes for retrieving studies, a
Communication with people who typically share more detailed look at each of Wilsons categories follows.
information with you Coopers and Manns examples appear in italics as they
Informal conversations at conferences or with students are woven into the discussion. The first two modes, foot-
Formal requests of scholars you knew were active in note chasing and consultation, are understandably attrac-
the field (e.g., solicitation letters) tive to most scholars, but may be affected by personal
Comments from readers/reviewers of past work biases more than the other three, which involve searching
General requests to government agencies
Mann 1993
relatively impersonal bibliographies or collections; and
Searches through people sources (whether by verbal so the discussion is somewhat weighted in favor of the
contact, E-mail, electronic bulletin board, letters, etc.) latter. It is probably best to assume that all five are needed,
Searches in Subject Indexes
though different topics may require them in different
Cooper 1985 proportions.
Computer search of abstract data bases (e.g., ERIC,
Psychological Abstracts)
Manual search of abstract data bases
4.8 FOOTNOTE CHASING
Mann 1993 This is the adroit use of other authors footnotes, or, more
Controlled-vocabulary searches in manual or printed broadly, their references to the prior literature on a topic.
sources
Key word searches in manual or printed sources
Trisha Greenhalgh and Richard Peacock discussed the
Computer searcheswhich can be done by subject use of snowballingfollowing up the references of docu-
heading, classification number, key word. . . ments retrieved in an ongoing literature search to accu-
Browsing
mulate still more documents (2005). The reason research
Cooper 1985 synthesists like footnote chasing is that it allows them to
Browsing through library shelves find usable primary studies almost immediately. More-
Mann 1993 over, the footnotes of a substantive work do not come as
Systematic browsing unevaluated listings (like those in an anonymous bibliog-
Citation Searches raphy), but as choices by an author whose critical judg-
Cooper 1985 ment one can assess in the work itself. They are thus more
Manual search of a citation index like scholarly intelligence than raw data, especially if one
Computer search of a citation index (e.g., SSCI ) admires the author who provides them.
Mann 1993 Footnote chasing is obviously a two-stage process: to
Citation searches in printed sources follow up on someone elses references, the work in
Computer searches by citation which they occur must be in hand. Some reference-
source: Adapted from Cooper (1985), Wilson (1992), Mann
bearing works will be already known; others must be dis-
(1993). The boldfaced headings are from Wilson. covered. Generally, there will be a strong correlation
60 SEARCHING THE LITERATURE

between familiarity of works and their physical distance strategy of presentation will depend on whether one is the
from the researcher; known items will tend to be nearby first to survey the literature of a topic.
(for example, in ones office), and unknown items, farther Manns Related Records searches are a sophisticated
away (for example, in a local or nonlocal library collec- form of footnote chasing that became practical only with
tion or undiscovered on the Web). computerization. They make it possible to find articles
The first chasing many researchers do is simply to as- that cite identical references. When viewing the full re-
semble the publications they already know or can readily cord of a single article in the Web of Science, one will see
discover. Indeed, from the standpoint of efficiency, the the number of references it makes to earlier writings, the
best way to begin a search is, first, to pick studies from number of times it has been cited by later articles, and a
ones own shelves and files and, second, to follow up button Find Related Records. Clicking the button will re-
leads to earlier work from their reference sections. Coo- trieve all articles that refer to at least one of the same ear-
pers example was references in nonreview papers from lier writings. They are ranked high to low by the number
journals you subscribe to (1985). of references they share with the seed article. (Informa-
According to Green and Hall, meta-analysts doing a tion scientists call this their bibliographic coupling
literature search should page through the volumes of the strength.) According to the help screen,
best journals for a topic year by yearpresumably not
only those to which they personally subscribe, but also the assumption behind Related Records searching is that
those in the library (1984, 46). An alerting service like articles that cite the same works have a subject relation-
Thomson Scientifics Current Contents can be used as ship, regardless of whether their titles, abstracts, or key-
words contain the same terms. The more cited references
well. Despite its apparent cost in time, this hand searching
two articles share, the closer this subject relationship is.
may actually prove rather effective, given the small and
focused set of titles in any particular table of contents. The shared cited works may themselves be retrievable
Items found in this fashion may lead as a bonus to earlier through the Web of Science interface.
studies, through what Cooper called references in non- Footnote chasing is an inviting methodso inviting
review papers you browsed through at the library (1985). that the danger lies in stopping with it, on the pretext that
Existing syntheses are probably the type of works most any other method will quickly produce diminishing re-
likely to be known to a researcher. If not, the first goal in turns. If one is serious about finding primary studies to
a literature search should be to discover previous review consider, one will not assume without checking that they
papers or books by others on a topic, because they are do not exist or that diminishing returns begin outside
likely to be both substantively important in their own ones office door. The substantive reason for concern is
right and a rich source of references to earlier studies. that one may fail to capture the full range of studiesand
There is a burgeoning literature on the retrieval of sys- reported effectsthat exist. Just as ones own footnotes
tematic reviews (see, for example, Harrison 1997; Boyn- reflect personal biases, so does ones collection of books
ton et al. 1998; White et al. 2001; Montori et al. 2004; and journals. The authors whose footnotes one pursues
InterTASC Information Specialists Sub-Group 2006). will tend to cite works compatible with biases of their
Another goal should be to discover topical bibliogra- own. In journalsthe key publications in sciencethis
phies compiled by others. The entries in free-standing tendency produces citation networks that link some jour-
bibliographies are not footnotes, and they may be harder nals tightly and others loosely or not at all; the result is
than footnotes to judge for relevance (especially if ab- specialties and disciplines that fail to communicate de-
stracts are lacking). Nevertheless, if a bibliography ex- spite interests in common (for an example, see Rice 1990).
ists, tracking down its preassembled references may Thus footnote chasing may simply reinforce the homoge-
greatly diminish the need for further retrievals (compare neity of ones findings, and other methods are required to
Wilson 1992; Bates 1992a). Mann has a nice chapter on learn whether unknown but relevant writings exist. The
bibliographies published in book form (1993, 4556). next strategy is only a partial solution, as we shall see.
A way of searching online for reviews and bibliogra-
phies will be given later. A good mindset at the start of a 4.9 CONSULTATION
search assumes the existence of at least one that is un-
known. It is also important to ascertain the nonexistence Most researchers trust people over bibliographies for
of such publications (if that is the case), given that ones answers on what is worth reading. Wilsons (1992) con-
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 61

sultationfinding usable studies by talking and writing cannot stay there by waiting for new publications to ap-
to others rather than by literature searchingis illustrated pear; you should be in personal communication with the
in table 4.2. Cooper gives examples such as informal con- creators of the literature and other key informants before
versations at conferences or with students, and Mann your attempt at new researchor synthesisbegins.
mentions ways of getting in touch. Actually, one is still Some of these people who typically share information
searching bibliographies; they are simply inside peoples with you may be localthat is, colleagues in your actual
heads. Everyone, including the most learned, does this; workplace with whom face-to-face conversations are
there is no more practical or fruitful way of proceeding. possible every day. Others may be geographically dis-
The only danger lies in reliance on a personal network persed but linked through telephone calls, e-mail, atten-
to the exclusion of alternate sourcesa danger similar to dance at conferences and seminars, and so on, in your
overreliance on a personal library. Bibliographies, how- invisible college (Cronin 1982; Price 1986).
ever dull, will help one search a literature more thor- Diana Crane described invisible colleges as social cir-
oughly than colleagues, however responsive. cles whose members intercommunicate because they
A Google search on meta-analysis and experts (one share an intense interest in a set of research problems
may add consult or contact) reveals that many papers now (1972). Although the exact membership of an invisible
add consultation of experts to their search strategiesnot college may be hard to define, a nucleus of productive
as something desirable but as something the authors suc- and communicative insiders can usually be identified. If
cessfully did. (Related search terms would no doubt add one is not actually a member of this elite, one can still try
more examples.) Some papers mention finding unpub- to elicit their advice on the literature by making formal
lished work through this means, but there is no ready requests of scholars active in the field (for example, so-
summary of what expert advice led to. A brief article in licitation letters). Scholarly advice in the form of com-
support of consultation claimed that, on the basis of a sur- ments from readers/reviewers of past work seems less
vey of experts for a review of primary medical care, forty likely unless one is already well-published. The Ameri-
of the references selected for review were identified by can Faculty Directory, the Research Centers Directory,
domain experts (McManus et al. 1998). Of the 102 and handbooks of professional organizations are useful
unique eligible references, they wrote, 50 (49%) were when one is trying to find addresses or telephone num-
identified by one of the electronic databases, 40 (39%) bers of persons to contact. Web searches are often sur-
by people working in the field, and 31 (30%) by hand prisingly productive, although beset by the problem of
searches (1562). The percentages exceed 100 because homonymic personal names.
some references were identified by more than one means. While consultation with people may well bring pub-
Twenty four references, they added, would have been lished writings to light, its unique strength lies in reveal-
missed entirely without the input of people working in ing unpublished writings (compare Royle and Milne
the field. A librarian complained that their article, billed 2003). To appreciate this point, one must be clear on the
as a review, was rather a case study that knocks down difference between published and unpublished. Writings
a straw man no librarian has ever erected (Due 1999). such as doctoral dissertations and Educational Resources
R. J. McManus and colleagues countered (following Information Center (ERIC) reports are often called un-
Dues letter) that what is self-evident to librarians is not published. Actually, any document for which copies will
necessarily so to clinicians. This may be true; one does be supplied to any requestor has been published in the
hear of innocents who think that database or Web searches legal sense (Strong 1990), and this would include reports,
retrieve everything that exists on a topic. conference proceedings, and dissertations. It would also
The quality of response that researchers can get by include some machine-readable data files. The confusion
consulting depends, of course, on how well connected occurs because people associate publication with a pro-
they are, and also perhaps on their energy in seeking new cess of editorial quality control; they write unpublished
ties. Regarding the importance of personal connections, a when they should write not independently edited or
noted author on scientific communication, the late Belver unrefereed. Documents such as reports, dissertations,
Griffith, told his students, If you have to search the liter- and data files are published, in effect, by their authors
ature before undertaking research, you are not the person (or other relatively uncritical groups), but they are pub-
to do the research. He was being only slightly ironic. In lished nonetheless. Conference proceedings, over which
his view, you may read to get to a research front, but you editorial quality control may vary, have a similar status.
62 SEARCHING THE LITERATURE

Preprintsmanuscripts sent to requestors after being ac- fied library stacks. What many researchers would regard
cepted by an editorare simply not published yet. as the weakness of these instrumentsthat they are so
Many printed bibliographies and bibliographic data- editorially inclusivemay be for research synthesists
bases cover only published documents. Yet research syn- their greatest strength. We turn now to the ways in which
thesists place unusual stress on including effects from one can use large-scale, critically neutral bibliographies
unpublished studies in their syntheses, to counteract a po- to explore existing subject literatures, as a complement to
tential bias among editors and referees for publishing the more selective approaches of footnote chasing and
only effects that are statistically significant (see chapter 6 consultation.
and chapter 23, this volume; for broader views, see
Chalmers et al. 1990; McAuley et al. 2000; Rothstein, 4.10 SEARCHES WITH SUBJECT INDEXING
Sutton, and Borenstein 2005). Documents of limited dis-
tributiontechnical reports, dissertations, and other grey Patrick Wilson called subject searching the strategy of
literaturemay be available through standard biblio- approaching the materials we want indirectly by using
graphic services on the Web (such as Dissertation Ab- catalogs, bibliographies, indexes: works that are primar-
stracts International and National Technical Information ily collections of bibliographical descriptions with more
Service). To get documents or data that are truly unpub- or less complete representations of content (1992, 156).
lished, one must receive them as a member of an invisible Seemingly straightforward, retrieval of unknown publi-
college (for example, papers in draft from distant col- cations by subject has in fact provided design problems
leagues) or conduct extensive solicitation campaigns (for to library and information science for over a century.
example, general requests to government agencies). There The main problem, of course, is to effect a match between
is no bibliographic control over unreleased writings in the searchers expressed interest and the documentary
file drawers except the human memories one hopes to tap description.
(see chapter 6, this volume). Only a few practical pointers can by given here. The
Consultation complements footnote chasing, in that it synthesist wanting to improve recall of publications
may lead to studies not referenced because they have through a computer search of abstract databases (for ex-
never been published; and, like footnote chasing, it pro- ample, ERIC, Psychological Abstracts) needs to under-
duces bibliographic advice that is selective rather than stand the different ways in which topical literatures may
uncritical. But also like footnote chasing, it may intro- be broken out from larger documentary stocks. One gets
duce bias into ones search. The problem again is too what one asks for, and so it helps to know that different
much homogeneity in the studies that consultants recom- partitions will result from different ways of asking. What
mend. As Cooper put it, A synthesist gathering research follows is intended to produce greater fluency in the tech-
solely by contacting hubs of a traditional invisible college nical vocabulary of document retrieval and to provide a
will probably find studies that are more uniformly sup- basis for the next chapter on computerized searching (see
portive of the beliefs held by these central researchers chapter 5, this volume; see also Cooper 1998; Grayson
than are studies gathered from all sources. Also, because and Gomersall 2003; Thames Valley Literature Search
the participants in an invisible college use one another as Standards Group 2006).
a reference group, it is likely that the kinds of operations Different kinds of indexing bind groups of publications
and measurements employed in the members research into literatures. These are authors natural-language terms,
will be more homogeneous than those employed by all indexers controlled-vocabulary terms, names of journals
researchers who might be interested in a given topic (or monographic series), and authors citations. While all
(1998, 47). These observations should be truer still of are usable separately in retrievals from printed biblio-
groups in daily contact, such as colleagues in an academic graphic tools, it is important to know that they can also be
department or teachers and their students. combined, to some degree, in online searching in attempts
The countermeasure is to seek heterogeneity. The na- to improve recall or precision or both. If the research syn-
tional conferences of research associations, for example, thesist is working with a professional searcher, all variet-
are trade fairs for people with new ideas. Their biblio- ies of terms should be discussed in the strategy-planning
graphic counterpartsbig, open, and diversifiedare the interview. Moreover, the synthesist would do well to try
national bibliographies, the disciplinary abstracting and out various search statements personally, so as to learn
indexing services, the online library catalogs, the classi- how different expressions perform. Mann, the librarian,
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 63

brings out a distinction not in Cooper between keyword Research synthesists have a special retrieval need
searches and controlled-vocabulary searches. in that they generally want to find empirical writings
with measured effects and acceptable research designs
4.10.1 Natural-Language Terms (compare Cooper and Ribble 1989). This is a precision
problem within the larger problem of achieving high
As authors write, they manifest their topics with terms recall. That is, given a particular topic, one does not want
such as aphasia, teacher burnout, or criterion-referenced to retrieve literally all writings on itmainly those with
education in titles, abstracts, and full texts. Because these a certain empirical content. Therefore, the strategy should
terms are not assigned by indexers from controlled vocab- be to specify the topic as broadly as possible (with both
ularies but emerge naturally from authors vocabularies, natural-language and controlled-vocabulary terms), but
librarians call them natural language or keywords. (Key- then to qualify the search by adding terms designed to
words is a slippery term. By it, librarians usually mean all match very specific natural language in abstracts, such
substantive words in an authors text. Authors, in contrast, as findings, results, ANOVA, random-, control-, t test,
may mean only the five or six phrases with which they F test, and correlat-. Professional searchers can build up
index their own writings at the request of editors. Retrieval large groupings of such terms (called hedges), with the
systems designers often use it to mean all nonstoplisted effect that if any one of them appears in an abstract, the
terms in a database, including the controlled vocabulary document is retrieved (Bates 1992b). The Cochrane
that indexers add. It can also mean the search terms that Handbook devotes an appendix to hedges, and the Co-
spontaneously occur to online searchers. More on this chrane review groups add specific ones (Higgins and
later.) Retrieval systems such as Dialog are capable of Green 2005).
finding any natural-language terms of interest in any or all This search strategy assumes that abstracts of empiri-
fields of the record. Thus, insofar as all documents with cal studies state methods and results, which is not always
teacher burnout in their titles or abstracts constitute a true. It would thus be most useful in partitioning a large
literature, that entire literature can be retrieved. The Web subject literature into empirical and nonempirical com-
also now makes it possible to retrieve documents on the ponents. Of course, it would also exclude empirical stud-
basis of term occurrence in authors full texts. ies that lacked the chosen signals in their abstracts, and so
Google has so accustomed people to searching with would have to be used carefully. Nevertheless, it confers
natural language off the top of their heads that they may a potentially valuable power.
not realize there is any other way. Other ways include
controlled-vocabulary and forward-citation retrievals. 4.10.2 Controlled Vocabulary
The Google search algorithm, though clever, cannot yet
distinguish among different senses of the same natural- These are the terms added to the bibliographic record by
language query, nor can it expand on that query to pick up the employees of abstracting and indexing services or
different wordings of the same concept, such as gentrifi- large research librariesbroadly speaking, indexers. The
cation versus renewal versus revitalization, or older major reason for controlled vocabulary in bibliographic
people versus senior citizens versus old-age pension- files is that authors natural language in titles, abstracts,
ers. The first failure explains why Google retrievals are and full texts may scatter related writings rather than
often both huge and noisy; the second, why Google re- bringing them together. Controlled vocabulary is supposed
trievals may miss so much potentially valuable material. to counteract scattering by re-expressing the content of
The first is a failure in precision; the second, a failure in documents with standard headings, thereby creating lit-
recall. Long before Google appeared, a famous study eratures on behalf of searchers who would otherwise have
by David Blair and M. E. Maron showed that lawyers to guess how authors might express a subject. (Blair and
who trusted computerized natural-language retrieval to Marons lawyers failed on just this count, and they are not
bring them all documents relevant to a major litigation alone.)
got only about one-fifth of those that were relevant (1985). Controlled vocabulary consists of such things as
The lawyers thought they were getting three-fourths or classification codes and subject headings for books,
more. Google taps the largest collection of documents in and descriptors for articles and reports. Classification
historya wonderful thingbut it has not solved the codes reflect a hierarchically arranged subject scheme,
problem that the Blair-Maron study addresses. such as that of the Library of Congress. With occasional
64 SEARCHING THE LITERATURE

exceptions, they are assigned one to a book, so that each (2003, 2). Abstracts may be missing or inadequate. Some
book will have a single position in collections arranged literatures may lack vocabulary control altogether. If the-
for browsing. On the other hand, catalogers may assign sauri exist, they may be inconsistent in the descriptors
more than one heading per book from the Library of Con- and other indexing features they offer, and indexers may
gress Subject Headings. Usually, however, they assign no apply them inconsistently. (For example, some indexers
more than three, because tables of contents and back-of- might apply, and others omit, terms indicating research
the-book indexes are presumed as complements. methods.) An upshot of these failings is that searchers
Subject headings are generally the most specific terms may find it harder to distinguish analyzable studies in the
(or compounds of terms, such as Reference services social sciences than in medicine.
AutomationBibliographies) that match the scope of an Supplements to descriptors include identifiers (special-
entire work. In contrast, descriptors, taken from a list ized terms not included in the thesaurus) and document
called a thesaurus, name salient concepts in writings types (which partition literatures by genres, such as article
rather than characterizing the work as a whole. Unlike or book review, rather than content). Being able to qualify
compound subject headings, which are combined by in- an online search by document type allows research syn-
dexers before any search occurs, descriptors are com- thesists to break out one of their favorite forms, past re-
bined by the searcher at the time of a computerized views of research, from a subject literature. For example,
search. Because the articles and reports they describe Marcia Bates united descriptors and document types in
often lack the internal indexes seen in books, they are this statement in an ERIC search on Dialog (1992b):
applied more liberally than subject headings, eight or ten
Select Mainstreaming AND (Deafness OR Hearing
being common.
Impairments OR Partial Hearing) AND (Literature
Descriptors are important because they are used to
Reviews OR State of the Art Reviews)
index the scientific journal and report literatures that syn-
thesists typically want, and a nodding familiarity with One could add Annotated Bibliographies to the expres-
such tools as the Thesaurus of ERIC Descriptors, the sion if that document type was also wanted. In the Web of
Medical Subject Headings, or the Thesaurus of Psycho- Science, document types such as reviews can be retrieved
logical Index Terms will be helpful in talking with librari- by checkbox. As noted, relevant review papers and bibli-
ans and information specialists when searches are to be ographies are among the first things one should seek when
delegated. Thesauri, created by committees of subject ex- preparing a new synthesis.
perts, enable one to define research interests in standard- Raya Fidel stated that some online searchers routinely
ized language. They contain definitions of terms (scope avoid descriptor lookups in thesauri in favor of natural
notes), cues from nonpreferred terms to preferred equiva- language (1991). That was before the Web and Google;
lents (for criterion-referenced education use competency- her observation would be more widely true today. Few
based education), and links between preferred terms nonprofessional searchers ever learn to distinguish be-
(aversion therapy is related to behavior modification). tween natural language and controlled vocabulary; every-
Their one drawback is that, because the standardization thing is just keywords, a blurring that system designers
of vocabulary takes time, they are always a bit behind au- perpetuate. For example, in the Cochrane Database of
thors expressions in natural language. Evidence-based Systematic Reviews, the box for entering search terms is
medicine, for example, was not a MeSH (Medical Sub- labeled Keywords, and so is the field in bibliographic re-
ject Headings) term until 1997 (Harrison 1997), though it cords where reviews are characterized by subject. The
appeared in journals at least five years earlier. Therefore, latter is actually filled for the most part with MeSH de-
anyone doing a search should be able to combine both scriptors, which are definitely controlled vocabulary. But
descriptors (or other controlled vocabulary) and natural one can also retrieve Cochrane records with almost any
language to achieve the desired fullness of recall. keyword from natural language. The word steadily, obvi-
The social sciences present searchers with graver re- ously not a MeSH descriptor, retrieves sixty as I write, for
trieval problems than medicine. According to Lesley example. Contemporary retrieval systems reflect the prin-
Grayson and Alan Gomersall, these include a more di- ciple of least effort for the greatest number; they cater to
verse literature; the greater variety and variability of sec- searchers who do not know about controlled vocabulary
ondary bibliographical tools; the increasing availability and probably would not use thesauri even if they did. But
of material on the internet; and a less precise terminology that does not mean that most searchers are good searchers
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 65

where both high recall and high precision are concerned. Web: use search terms to form sets of documents and
To retrieve optimally for a research synthesis, searchers then look for items of interest.
should be able to exploit all the resources available to Library classification codes, journal names, and terms
them. entered into search engines are all ways of assembling
documents on grounds of their presumed similarity. Peo-
4.10.3 Journal Names ple who browse examine these collections in hopes they
can recognize valuable items, as opposed to seeking
When editors accept contributions to journals, or to known items by name. In effect, they let collections
monographic series in the case of books, they make the search them, rather than the other way around. Of course,
journal or series name a part of the writings standard bib- this kind of openness has always been a minority trait;
liographic record. Writings so tagged form literatures of browsing is not simply one more strategy that research
a kind. Abstracting and indexing services usually publish synthesists in general will adopt. Even delegating it to
lists of journals they cover, and many of the journal titles someone such as a librarian or information specialist is
may be read as if they were broad subject headings. problematical, because success depends on recognizing
To insiders, however, names such as American Socio- useful books or other writings that were not foreknown,
logical Review or Psychological Bulletin connote not and people differ in what they recognize.
only a subject matter but also a level of authority. The lat- Patrick Wilson called browsing book stacks a good
ter is a function of editorial quality control, and it can be strategy when one can expect to find a relatively high
used to rank journals in prestige, which implicitly extends concentration of things one is looking for in a particular
to the articles appearing in their pages. Earlier, a manual section of a collection; it is not so good if the items being
search through the best journals was mentioned sought are likely to be spread out thinly in a large collec-
probably for items thought to be suitably refereed before tion (1992, 156). The Library of Congress (or Dewey)
publication. Actually, if one wants to confine a subject classification codes assigned to books are supposed to
search to certain journals, the computer offers an alterna- bring subject-related titles together so that, for example,
tive. Journal and series names may be entered into search knowledge of one book will lead to knowledge of others
statements just like other index terms. Thus one may per- like it. Unfortunately, in the system as implemented
form a standard subject or citation search but restrict the diverse classifiers facing the topical complexity of publi-
final set to items appearing in specific journals or series. cations over decadesit is common for dissimilar items
For example: to be brought together and similar items to be scattered.
Select Aphasia AND JN  Journal of Verbal Learning But Googles algorithms make these mistakes, too, and
and Verbal Behavior so do indexers of journal articles; one must be able to tol-
erate low-precision retrievals at best.
One might still have a fair number of abstracts to browse Still, serendipitous finds do take place. It is also true
through, but all except those from the desired journal that research synthesists and their support staffs do more
would be winnowed out. than gather data-laden articles for systematic reviews.
They need background knowledge of many kinds, and
4.11 BROWSING even old-fashioned library browsing may further its pur-
suit. For example, several textbooks on meta-analysis can
Browsing, too, is a form of bibliographic searching. be found if one traces the class number of, say, Kenneth
Browsing through library shelves is not one of the major- Wachter and Miron Strafs book (1990) to the H62 sec-
ity strategies in table 4.2. The referees for a draft of the tion (Social sciences (general). Study and teaching. Re-
present chapter claimed that no one browses in libraries search.) in large libraries using the Library of Congress
any moreonly on the Web. But wherever one browses, classification scheme.
the principle is the same:
Library shelves: face book spines classed by librari- 4.12 CITATION SEARCHES
ans and then look for items of interest. The last type of strategy identified in table 4.2 is, in Coo-
Journals: subscribe to articles bundled by editors and pers language, the manual or computer search of a cita-
then look for items of interest. tion index (for example, SSCI).
66 SEARCHING THE LITERATURE

Authors citations make substantive or methodological citing articles. (Thomsons citation databases now have
links between studies explicit, and networks of cited and descriptors, but these are not governed by a publicly
citing publications may thus constitute literatures. In cita- available thesaurus.) Title terms may also be intersected
tion indexes, as noted, earlier writings become indexing with names of cited documents or cited authors to im-
terms by which to trace the later journal articles that cite prove precision.
them. Whereas the earlier writings may be of any sort Thomson is now testing a Web Citation Index for Web
books, articles, reports, government documents, confer- documents, but a similar tool, Google Scholar, has al-
ence papers, dissertations, films, and so onthe later ready won wide acceptance because of its ready avail-
writings retrieved will be drawn solely from the journals ability. Given an entry term such as a subject description,
(or other serials) covered by Thomson Scientific; no non- an authors name, or the title of a document, Scholar will
journal items, such as books, will be retrieved. Neverthe- retrieve documents on the Web bearing that phrase as
less, forward citation searching, still underutilized, has well as the documents that cite them. The documents re-
much to recommend it, because it tends to produce hits trieved are ranked high to low by the number of docu-
different from those produced by retrievals with natural ments citing them, and the latter can themselves be re-
language or controlled vocabulary. In other words, it trieved through a clickable link. Because the Web is so
yields writings related to ones topic that have relatively capacious, retrievals are likely to include both journal ar-
little overlap with those found by other means (Pao and ticles and more informally published documents (such as
Worthen 1989). dissertations or technical reports released in pdf) that
The reasons for lack of overlap are, first, that authors have found their way onto the Web. This makes Scholar a
inadvertently hide the relevance of their work to other wonderful device for extending ones reach in literature
studies by using different natural language, which has a retrieval. Precisely because its use is so seductive, how-
scattering effect, and, second, that indexers often fail to ever, one should know that its lack of vocabulary control
remedy this when they apply controlled vocabulary. Luck- can make for numerous false drops.
ily, the problems of both kinds of terminology are partly Thomsons Web of Science databases also have prob-
corrected by citation linkages, which are vocabulary- lems of vocabulary control. For example, where authors
independent (as noted in the earlier discussion of Related are concerned, different persons can have the same name
Records). They are also authors linkages rather than (homonyms), and the same person can have different
indexers, and so presumably reflect greater subject ex- names (allonyms, a coinage of White 2001). Homonyms
pertise. Last, because Thomsons citation databases are affect precision, because works by authors or citees other
multidisciplinary, they may cut across standard disciplin- than the one the searcher intended will be lumped to-
ary lines. (In traditional footnote chasing, we saw, one gether with it in the retrieval and must be disambiguated.
retrieves many writings cited by one author, whereas in The practice in these databases of using only initials for
forward citation searching one starts with one writing and first and middle names worsens the problem. (For in-
retrieves the many authors who cited it, which improves stance, Lee AJ unites what Lee, Alan J. and Lee, Amos
the chances for disciplinary diversity.) Hence, forward John would separate.) Allonyms affect recall, because
citation searching should be used with other retrieval pro- the searcher may be unable to guess all the different ways
cedures by anyone who wants to achieve high recall. an authors or citees name has been copied into the
Several kinds of forward citation searching are possible database. The same author usually appears in at least
in Scisearch, Social Scisearch, and Arts and Humanities two ways (for example, Small HG and Small H). Derek J.
Search on Dialog. In printed form, these are respectively de Solla Prices tricky name is cited in more than a
Science Citation Index, Social Sciences Citation Index, dozen ways.
and Arts and Humanities Citation Index. Thomsons own One other matter. Thomson Scientific is the former In-
Web of Science makes the same three databases available stitute for Scientific Information (ISI) in Philadelphia and
for joint or separate searching. One needs only modest continues ISIs policy of includingand excluding
bibliographic data to express ones interests: retrievable journals in the citation indexes on the basis of their inter-
literatures can be defined as all items that cite a known national citedness (Testa 2006). This imperfectly under-
document, a known author, or a known organization. If stood policy has often led to accusations of bias toward
need be, such data can be discovered by searching ini- U.S. or North American or Anglo-American sources,
tially on natural language from titles and abstracts of since many thousands of the worlds journals do not make
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 67

the cut. It is true that the ISI/Thomson indexes have an bibliographic information provided (Cooper and Ribble
English-language bias, because English is so dominant 1989). A major remaining cognitive variable is openness
in the literatures of science and scholarship. Even so, to information, as identified by David Davidson (1977).
leading journals in other languages are covered. Yet re- Although research synthesists as a group seem highly
searchers routinely lament the absence of journals, in open to information, considerable variation still exists in
both English and other languages, they deem important. their ranks: some are more open than others. This proba-
Given the economic constraints on ISI/Thomson and its bly influences not only their relevance judgments, but
subscribers, many journals must always be omitted. New their relative appetites for multimodal literature searches
tools such as Google Scholar, Scopus, and the Chinese and their degree of tolerance for studies of less than ex-
Social Sciences Citation Index will be needed to extend emplary quality. Some want to err on the side of inclu-
researchers capabilities for retrieval across relatively siveness, and others want stricter editorial standards. The
less-cited journals, languages, and genres. tension between the two tendencies is felt in their own
symposia, sometimes even within the same writer.
4.13 FINAL OPERATIONS Where service to practitioners is concerned, a study by
Ruth Lewis, Christine Urquhart, and Janet Rolinson re-
The synthesists ideal in gathering primary studies is to ports that health professionals doubt librarians ability to
have the best possible pool from which to select those fi- judge the relevance of meta-analyses for evidence-based
nally analyzed. In practice, most would admit that the se- practice (1998). However, other studies show librarians
lection of studies has a discretionary element. Research and information specialists gearing up for their new re-
syntheses are constructs shaped by individual taste and sponsibilities on this front (Cumming and Conway 1998;
by expectations about the readership. Different synthe- Tsafrir and Grinberg 1998).
sists interested in the same topic may differ on the pri-
mary studies to be included. If a particular set of studies 4.13.2 Document Delivery
is challenged as incomplete, a stock defense would claim
that it was not cost-effective to gather a different set. The The final step in literature retrieval is to obtain copies of
degree to which the synthesist has actually tried the vari- items judged relevant. The online vendors cooperate with
ous avenues open will determine whether such a defense document suppliers so that hard copies of items may be
has merit. ordered as part of an online search on the Web, often
through ones academic library. Frequently the so-called
4.13.1 Judging Relevance grey literature, such as dissertations, technical reports,
and government documents, can also be acquired on the
The comprehensiveness of a search may be in doubt, but Web or through specialized documentation centers. Inter-
it is likely that judging the resulting documents for rele- library loan services are generally reliable for books, grey
vance will be constrained by the fact that research synthe- publications, and photocopies of articles when local col-
sis requires comparable, quantified findings. The need for lections fail. The foundation of international interlibrary
studies with data sharply delineates what is relevant and loan in North America is the OCLC system, which has
what is not: the presence or absence of appropriate ana- bibliographic records and library holdings data for more
lytical tables gives an explicit test. Of course, if nonquan- than 67 million titles, including a growing number of nu-
titative studies are included, relevance judgments may be meric data files in machine-readable form. In the United
more difficult. Presumably, the judges will be scientists Kingdom, the British Library at Boston Spa serves an in-
or advanced students with subject expertise, high interest ternational clientele for interlibrary transactions. Increas-
in the literature to be synthesized, and adequate time in ingly the transfer of texts, software, and data files is being
which to make decisions. Further, they will be able to de- done between computers. Even so, organizations such as
cide on the basis of abstracts or even full texts of docu- OCLC and the British Library are needed to make re-
ments if titles leave doubts. That limits the potential sources discoverable in the new electronic wilderness.
sources of variation in relevance judgments to factors The key to availing oneself of these resources is to
such as how documents are to be used (Cuadra and Katter work closely with librarians and information specialists,
1967), order of presentation of documents (Eisenberg and whose role in scientific communication has received
Barry 1988), subject knowledge of judges, and amount of more emphasis here than is usual, in order to raise their
68 SEARCHING THE LITERATURE

visibility (Smith 1996; Petticrew 2001). Locating an ex- Centre for Reviews and Dissemination, University of York.
pert in high recall may take effort; persons with the 2005. Finding Studies for Systematic Reviews: A Checklist
requisite motivation and knowledge are not at every ref- for Researchers. http://www.york.ac.uk/inst/crd/revs.htm
erence desk. But such experts exist, and it is worth the Chalmers, T. C., C. S. Frank, and D. Reitman. 1990. Mini-
effort to find them. mizing the Three Stages of Publication Bias. Journal of
the American Medical Association 263(10): 13925.
4.14 REFERENCES Cochrane, Archie L. 1989. Foreword to Effective Care in
Pregnancy and Childbirth, edited by I. Chalmers, M. Enkin,
Ackoff, Russell L., et al. 1976. Designing a National Scien- and M. Keirse. Oxford: Oxford University Press.
tific and Technological Communication System. Philadel- Committee on Scientific and Technical Communication.
phia: University of Pennsylvania Press. 1969. Scientific and Technical Communication: A Pressing
Bates, Marcia J. 1992a. Rigorous Systematic Bibliogra- National Problem and Recommendations for Its Solution.
phy. In For Information Specialists; Interpretations of Washington, D.C.: National Academy of Sciences.
Reference and Bibliographic Work, by Howard D. White, Conn, Vicki S., Sang-arun Isaramalai, Sabyasachi Rath, Pee-
Marcia J. Bates, and Patrick Wilson. Norwood, N.J.: Ablex. ranuch Jantarakupt, Rohini Wadhawan, and Yashodhara
. 1992b. Tactics and Vocabularies in Online Search- Dash. Beyond MEDLINE for Literature Searches. Jour-
ing. In For Information Specialists; Interpretations of nal of Nursing Scholarship 35(2): 17782.
Reference and Bibliographic Work, by Howard D. White, Cooper, Harris M. 1985. Literature Searching Strategies of
Marcia J. Bates, and Patrick Wilson. Norwood, N.J.: Integrative Research Reviewers. American Psychologist
Ablex. 40(11): 12679.
Beahler, C. C., J. J. Sundheim, and N. I. Trapp. 2000. Infor- . 1987. Literature-Searching Strategies of Integra-
mation Retrieval in Systematic Reviews: Challenges in the tive Research Reviewers: A First Survey. Knowledge:
Public Health Arena. American Journal of Preventive Me- Creation, Diffusion, Utilization 8(2): 37283.
dicine 18(4): 610. . 1998. Synthesizing Research: A Guide for Literature
Bernier, Charles L., and A. Neil Yerkey. 1979. Cogent Com- Reviews, 3rd ed. Thousand Oaks, Calif.: Sage Publications.
munication; Overcoming Reading Overload. Westport, . 2000. Strengthening the Role of Research in
Conn.: Greenwood Press. Policy Decisions: The Campbell Collaboration and the
Blair, David C., and M. E. Maron. 1985. An Evaluation of Promise of Systematic Research Review. In Making Re-
Retrieval Effectiveness for a Full-Text Document Retrieval search a Part of the Public Agenda, MASC Report 104,
System. Communications of the ACM 28(3): 28999. edited by Mabel Rice and Joy Simpson. Lawrence, Kan.:
Bland, Carole J., L. N. Meurer, and George Maldonado. Merrill Advanced Studies Center. http://www.merrill.ku
1995. A Systematic Approach to Conducting a Non- .edu/publications/2000whitepaper/cooper.html.
Statistical Meta-Analysis of Research Literature. Acade- Cooper, Harris M., and Ronald G. Ribble. 1989. Influences
mic Medicine 70(7): 64253. on the Outcome of Literature Searches for Integrative Re-
Boynton, Janette, Julie Glanville, David McDaid, and Carol search Reviews. Knowledge: Creation, Diffusion, Utiliza-
Lefebvre. 1998. Identifying Systematic Reviews in MED- tion 10(3): 179201.
LINE: Developing an Objective Approach to Search Stra- Crane, Diana. 1972. Invisible Colleges; Diffusion of
tegy Design. Journal of Information Science 24(3): Knowledge in Scientific Communities. Chicago: University
13754. of Chicago Press.
Branwell, V. H., and C. J. Williams. 1997. Do Authors of Cronin, Blaise. 1982. Invisible Colleges and Information
Review Articles Use Systematic Methods to Identify, As- Transfer; A Review and Commentary with Particular Refe-
sess and Synthesize Information? Annals of Oncology rence to the Social Sciences. Journal of Documentation
8(12): 118595. 38(3): 21236.
Campbell Collaboration. 2006. Guidelines. http://www Cuadra, Carlos A., and Robert V. Katter. 1967. Opening the
.campbellcollaboration.org/guidelines.html. Black Box of Relevance. Journal of Documentation 23(4):
Campbell Education Coordinating Group. 2005. Infor- 291303.
mation Retrieval Methods Group Systematic Review Cumming, Lorna, and Lynn Conway. 1998. Providing
Checklist. http://www.campbellcollaboration.org/ECG/ Comprehensive Information and Searches with Limited
PolicyDocuments.asp. Resources. Journal of Information Science 24(3): 18385.
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 69

Davidson, David. 1977. The Effect of Individual Differen- Meta-Analysis: A Quantitative Review of the Aphasia
ces of Cognitive Style on Judgments of Document Rele- Treatment Literature. In The Future of Meta-Analysis, edi-
vance. Journal of the American Society for Information ted by Kenneth W. Wachter and Miron L. Straf. New York:
Science 28(5): 27384. Russell Sage Foundation.
Dixon-Woods, Mary, Andrew Booth, and Alex J. Sutton. Griffith, Belver C., and A. James Miller. 1970. Networks of
2007. Synthesizing Qualitative Research: A Review of Informal Communication among Scientifically Productive
Published Reports. Qualitative Research 7(3): 375422. Scientists. In Communication among Scientists and Engi-
Doig, Gordon Stuart, and Fiona Simpson. 2003. Efficient neers, edited by Carnot E. Nelson and Donald K. Pollock.
Literature Searching: A Core Skill for the Practice of Lexington, Mass.: Heath Lexington Books.
Evidence-Based Medicine. Intensive Care Medicine Grupp, Gunter, and Mary Heider. 1975. Non-Overlapping
29(12): 211927. Disciplinary Vocabularies. Communication from the Recei-
Due, Stephen. 1999. Study Only Proves What Librarians vers Point of View. In Communication of Scientific Infor-
Knew Anyway. British Medical Journal 319: 259. mation, edited by Stacey B. Day. Basel: S. Karger.
Egger, Matthias, George Davey-Smith, and Douglas G. Alt- Harrison, Jane. 1997. Designing a Search Strategy to Iden-
man, eds. 2001. Systematic Reviews in Health Care: Meta- tify and Retrieve Articles on Evidence-Based Health Care
Analysis in Context. London: BMJ Publshing. Using MEDLINE. Health Libraries Review 14: 3342.
Eisenberg, Michael, and Carol Barry. 1988. Order Effects: Higgins, Julian P. T., and Sally Green, eds. 2005. Cochrane
A Study of the Possible Influence of Presentation Order on Handbook for Systematic Reviews of Interventions, vers.
User Judgments of Document Relevance. Journal of the 4.2.5. Chichester: John Wiley & Sons. http://www.cochrane
American Society for Information Science 39(5): 293300. .org/resources/handbook/hbook.htm.
Fink, Arlene. 2005. Conducting Research Literature Reviews: Hunt, Morton. 1997. How Science Takes Stock: The Story of
From the Internet to Paper, 2nd ed. Thousand Oaks, Calif.: Meta-Analysis. New York: Russell Sage Foundation.
Sage Publications. Hunter, John E., and Frank L. Schmidt. 2004. Methods of
Fidel, Raya. 1991. Searchers Selection of Search Keys: III. Meta-Analysis: Correcting Error and Bias in Research Fin-
Searching Styles. Journal of the American Society for dings, 2nd ed. Thousand Oaks, Calif.: Sage Publications.
Information Science 42(7): 51527. InterTASC Information Specialists Sub-Group. 2006.
Garfield, Eugene. 1989. Reviewing Review Literature. In Search Filter Resource. http://www.york.ac.uk/inst/crd/
Essays of an Information Scientist, vol. 10. Philadelphia, intertasc/index.htm.
Pa.: ISI Press. Jackson, Gregg B. 1978. Methods for Reviewing and Inte-
Garvey, William D., and Belver C. Griffith. 1968. Informal grating Research in the Social Sciences. Springfield, Va.:
Channels of Communication in the Behavioral Sciences: National Technical Information Service.
Their Relevance in the Structure of Formal or Bibliogra- . 1980. Methods for Integrative Reviews. Review of
phic Communication. In The Foundations of Access to Educational Research 50(3): 43860.
Knowledge; A Symposium, edited by Edward B. Montgo- Jones, Myfanwy Lloyd. 2004. Application of Systematic
mery. Syracuse, N.Y.: Syracuse University Press. Methods to Qualitative Research: Practical Issues. Jour-
Grayson, Lesley, and Alan Gomersall. 2003. A Difficult Bu- nal of Advanced Nursing 48(3): 27178.
siness: Finding the Evidence for Social Science Reviews. Laird, Nan M. 1990. A Discussion of the Aphasia Study.
ESRC Working Paper 19. London: ESRC UK Centre for In The Future of Meta-Analysis, edited by Kenneth
Evidence Based Policy and Practice. http://evidencenetwork W. Wachter and Miron L. Straf. New York: Russell Sage
.org/Documents/wp19.pdf. Foundation.
Green, Bert F., and Judith A. Hall. 1984. Quantitative Lee, Wen-Lin, R. Barker Bausell, and Brian M. Berman.
Methods for Literature Reviews. Annual Review of Psy- 2001. The Growth of Health-Related Meta-Analyses Pub-
chology 35: 3753. lished from 1980 to 2000. Evaluation and the Health
Greenhalgh, Trisha, and Richard Peacock. 2005. Effective- Professions 24(3): 32735.
ness and Efficiency of Search Methods in Systematic Lewis, Ruth A., Christine J. Urquhart, and Janet Rolinson.
Reviews of Complex Evidence: Audit of Primary Sources. 1998. Health Professionals Attitudes towards Evidence-
British Medical Journal 331(7524): 106465. Based Medicine and the Role of the Information Professio-
Greenhouse, Joel B., D. Fromm, S. Iyengar, M. A. Dew, nal in Exploitation of the Research Evidence. Journal of
A.L. Holland, and R. E. Kass. 1990. The Making of a Information Science 24(5): 28190.
70 SEARCHING THE LITERATURE

Lipsey, Mark W., and David B. Wilson. 2001. Practical Petticrew, Mark. 2001. Systematic Reviews from Astron-
Meta-Analysis. Thousand Oaks, Calif.: Sage Publications. omy to Zoology: Myths and Misconceptions. British Med-
Mann, Thomas. 1993. Library Research Models: A Guide to ical Journal 322(7278): 98101. http://bmj.bmjjournals
Classification, Cataloging, and Computers. New York: Ox- .com/cgi/content/full/322/7278/98.
ford University Press. Petticrew, Mark, and Helen Roberts. 2006. Systematic Re-
McAuley, Laura, Ba Pham, Peter Tugwell, and David Moher. views in the Social Sciences: A Practical Guide. Malden,
2000. Does the Inclusion of Grey Literature Influence Es- Mass.: Blackwell.
timates of Intervention Effectiveness Reported in Meta- Price, Derek J. de Solla. 1986. Invisible Colleges and
Analyses? Lancet 356(9237): 122831. the Affluent Scientific Commuter. In Little Science, Big
McManus, R. J., S. Wilson, B. C. Delaney, D. A. Fitzmau- Science . . . and Beyond. New York: Columbia University
rice, C. J. Hyde, R. S. Tobias, S. Jowett, and F. D. R. Hobbs. Press.
1998. Review of the Usefulness of Contacting Other Ex- Rice, Ronald E. 1990. Hierarchies and Clusters among
perts When Conducting a Literature Search for Systematic Communication and Library and Information Science Jour-
Reviews. British Medical Journal 317: 15623. nals, 19771987. In Scholarly Communication and Bib-
Menzel, Herbert. 1968. Informal Communication in liometrics, edited by Christine L. Borgman. Newbury Park,
Science: Its Advantages and Its Formal Analogues. In The Calif.: Sage Publications.
Foundations of Access to Knowledge; A Symposium, edited Rosenthal, Robert. 1984. Meta-Analytic Procedures for So-
by Edward B. Montgomery. Syracuse, N.Y.: Syracuse Uni- cial Research. Beverly Hills, Calif.: Sage Publications.
versity Press. Rothstein, Hannah R., Alexander J. Sutton, and Michael Bo-
Montori, Victor M., Nancy L. Wilczynski, Douglas Morgan, renstein, eds. 2005. Publication Bias in Meta-Analysis:
R. Brian Haynes. 2004. Optimal Search Strategies for Prevention, Assessment and Adjustments. New York: John
Retrieving Systematic Reviews from Medline: Analytical Wiley & Sons.
Survey. British Medical Journal. http://bmj.bmjjournals Rothstein, Hannah R., Herbert M. Turner, and Julia G. La-
.com/cgi/content/short/bmj.38336.804167.47v1. venberg. 2004. The Campbell Collaboration Information
Mulrow, Cynthia D. 1987. The Medical Review Article: Retrieval Policy Brief, Prepared for the Campbell Collabo-
State of the Science. Annals of Internal Medicine 106(3): ration Steering Committee. Philadelphia, Pa.: Campbell
48588. Collabortion. http://www.duke.edu/web/c2method/IRMG-
Mulrow, Cynthia D., and Deborah Cook. 1998. Systematic PolicyBriefRevised.pdf.
Reviews: Synthesis of Best Evidence for Health Care Deci- Royle, Pamela, and Ruairidh Milne. 2003. Literature Search-
sions. Philadelphia, Pa.: American College of Physicians. ing for Randomized Controlled Trials Used in Cochrane
Myers, Greg. 1991. Stories and Styles in Two Molecular Reviews: Rapid versus Exhaustive Searches. International
Biology Review Articles. In Textual Dynamics of the Journal of Technology Assessment in Health Care 19(4):
Professions; Historical and Contemporary Studies of Writ- 591603.
ing in Professional Communities, edited by Charles Bazer- Schlosser, Ralf W., Oliver Wendt, Katie Angemeier, and
man and James Paradis. Madison: University of Wisconsin Manisha Shetty. 2005. Searching for Evidence in Aug-
Press. mentative and Alternative Communication: Navigating a
Nordic Campbell Center, Danish National Institute of Social Scattered Literature. Augmentative and Alternative Com-
Research. 2006. What Is a Systematic Review? http:// munication 21(4): 23355.
www.sfi.dk/sw28581.asp. Smeets, Rob J. E. M., Derick Wade, Alita Hidding, Peter Van
Pao, Miranda Lee, and Dennis B. Worthen. 1989. Retrieval Leeuwen, Johan Vlaeyen, and Andre Knottneruset. 2006.
Effectiveness by Semantic and Citation Searching. Jour- The Association of Physical Deconditioning and Chronic
nal of the American Society for Information Science 40(4): Low Back Pain: A Hypothesis-Oriented Systematic Re-
22635. view. Disability and Rehabilitation 28(11): 67393.
Pawson, Ray. 2006. Evidence-Based Policy: A Realist Per- Smith, Jack T. 1996. Meta-Analysis: The Librarian as a
spective. London: Sage Publications. Member of an Interdisciplinary Research Team. Library
Petitti, Diana B. 2000. Meta-Analysis, Decision Analysis and Trends 45(2): 26579.
Cost-Effectiveness Analysis: Methods for Quantitative Syn- Strong, William S. 1990. The Copyright Book: A Practical
thesis in Medicine. New York: Oxford University Press. Guide, 3rd ed. Cambridge, Mass.: MIT Press.
SCIENTIFIC COMMUNICATION AND LITERATURE RETRIEVAL 71

Thames Valley Literature Search Standards Group. 2006. Systematic Reviews to Housing Research. Housing Studies
The Literature Search Process: Protocols for Researchers. 21(2): 297314.
http://www.Library.nhs.uk/SpecialistLibraries/Download Wang, Morgan C., and Brad J. Bushman. 1999. Integrating
.aspx?resID=101405. Results Through Meta-Analytic Review Using SAS Software.
Testa, James. 2006. The Thomson Scientific Journal Selec- Raleigh, N.C.: SAS Institute.
tion Process. International Microbiology 9(2): 13538. Weick, Karl E. 1970. The Twigging of Overload. In People
Tsafrir, J., and M. Grinberg. 1998. Who Needs Evidence- and Information, edited by Harold B. Pepinsky. New York:
Based Health Care? Bulletin of the Medical Library Asso- Pergamon.
ciation 86(1): 4045. White, Howard D. 1994. Scientific Communication and
van Tulder, Maurits, Andrea Furlan, Claire Bombardier, Lex Literature Retrieval. Handbook of Research Synthesis,
Bouter, and the Editorial Board of the Cochrane Collabora- edited by Harris Cooper and Larry V. Hedges. New York:
tion Back Review Group. 2003. Updated Method Guide- Russell Sage Foundation.
lines for Systematic Reviews in the Cochrane Collabora- . 2001. Authors as Citers over Time. Journal of the
tion Back Review Group. Spine 28(12): 129099. American Society for Information Science 52(2): 87108.
Wachter, Kenneth W., and Miron L. Straf, eds. 1990. The Fu- White, V. J., J. M. Glanville, C. Lefebvre, and T. A. Sheldon.
ture of Meta-Analysis. New York: Russell Sage Foundation. 2001. A Statistical Approach to Designing Search Filters
Wade, Anne C., Herbert M. Turner, Hannah R. Rothstein, to Find Systematic Reviews: Objectivity Enhances Accu-
and Julia G. Lavenberg. 2006. Information Retrieval and racy. Journal of Information Science 27(6): 35770.
the Role of the Information Specialist in Producing High- Wilson, Patrick. 1977. Public Knowledge, Private Igno-
Quality Systematic Reviews in the Social, Behavioural rance; Toward a Library and Information Policy. Westport,
and Education Sciences. Evidence & Policy: A Journal Conn.: Greenwood Press.
of Research, Debate and Practice 2(1): 89108. http:// . 1992. Searching: Strategies and Evaluation. In For
www.ingentaconnect.com/content/tpp/ep/2006/00000002/ Information Specialists; Interpretations of Reference and
00000001/art00006. Bibliographic Work, by Howard D. White, Marcia J. Bates,
Wallace, Alison, Karen Croucher, Mark Bevan, Karen Jack- and Patrick Wilson. Norwood, N.J.: Ablex.
son, Lisa OMalley, and Deborah Quilgars. 2006. Evidence Ziman, John. 1968. Public Knowledge: The Social Dimen-
for Policy Making: Some Reflections on the Application of sion of Science. Cambridge: Cambridge University Press.
5
USING REFERENCE DATABASES

JEFFREY G. REED PAM M. BAXTER


Marian University Cornell University

CONTENTS
5.1 Introduction 74
5.2 Database Selection 76
5.2.1 Subject Coverage 77
5.2.2 Years of Coverage 77
5.2.3 Publication Format 78
5.2.4 Database Availability 79
5.2.5 Vendor Differences 79
5.3 Structuring and Conducting the Search 80
5.3.1 Search Strategy 80
5.3.2 Steps in the Search Process 80
5.3.2.1 Topic Definition 80
5.3.2.2 Search Terms 81
5.3.2.3 Search Profile 84
5.3.2.4 Main Searches 86
5.3.2.5 Document Evaluation 89
5.3.2.6 Final Searches 89
5.3.3 Author Search 89
5.3.4 Citation Searching 89
5.4 Issues and Problems 90
5.4.1 Need for Searching Multiple Databases or Indexes 90
5.4.2 Need for Multiple Searches 90
5.4.3 Manual Searching Versus Electronic Databases 91
5.4.4 Limits of Vocabulary Terms 92
5.4.5 Underrepresented Research Literature Formats 92
5.4.5.1 Reports and Papers 92
5.4.5.2 Books and Book Chapters 92
5.4.5.3 Conference Papers and Proceedings 93

73
74 SEARCHING THE LITERATURE

5.4.6 Inconsistent Coverage Over Time 93


5.4.7 Non-U.S. Publication 93
5.4.8 Adequate Search Documentation 94
5.4.9 World Wide Web and Search Engines 94
5.4.10 Deciding When to Stop Searching 94
5.5 Appendix: Selected Databases in the Behavioral and Social Sciences 95
5.6 References 100

5.1 INTRODUCTION or Library Use: Handbook for Psychology (Reed and


Baxter 2003) for an overview of the basics of library use
The objective of this chapter is to introduce standard lit- and search processes.
erature search tools and their use in research synthesis. Why offer this chapter? In our experience, researchers
Our focus is on research databases relevant to the social, may fail to identify important citations to documents
behavioral, and medical sciences. We assume that the relevant to their project, either because they select inap-
reader propriate search tools or because their search strategy
is acquainted with the key literature in his or her own and process are incomplete. Our goal is to help research-
discipline (such as, basic journal publications, hand- ers make the best possible use of bibliographic search
books and other reference tools) and the most impor- tools (in this case, reference databases), so that they can
tant sources on the topic of synthesis; avoid this problem. When conducting a research synthe-
sis on a topic, the goal is, or should be, completeness of
intends to conduct an exhaustive search of the litera-
coverage. Failure to identify an important piece of re-
turethat is he or she desires to be thorough and
search (for example, an article, paper, report, or presen-
identify a comprehensive list of almost all citations
tation) that contains data relevant to the topic may mean
and documents that provide appropriate data relevant
that the results of the synthesis could be biased and the
to the subject under investigation;
conclusions reached or generalizations made possibly
is familiar with the structure and organization of re- incorrect.
search libraries and the basic finding tools associated This chapter is focused on reference databases, also
with them such as catalogs of their holdings and pe- known as bibliographic databases (sources such as Psyc-
riodical indexes. It is important to note that the mis- INFO and MEDLINE), that contain citations to docu-
sion of a research library goes beyond the basic ma- ments. The citations in the reference databases are surro-
terials and services designed to support course-related gates that represent the documents themselves. Citations
instruction. It maintains printed and electronic mate- in some sources are very brief; they provide only biblio-
rials that provide information for independent study graphic information that enables one to locate the docu-
and exploration, staff dedicated to locating resources mentfor example. author, title, publication date. Other
beyond those available in the local institution, and sources contain more extensive surrogate records, includ-
services to link the individual researcher with those ing the citation and bibliographic information as well as
resources; an abstract or summary of the source (for example, Psyc-
has or will familiarize him- or herself with other INFO). Some sources, such as PsycARTICLES, may
chapters in this volume, especially those on scien- contain electronic versions of the documents themselves.
tific communication (chapter 4) and grey literature The documents are those materials that contain the infor-
(chapter 6); mation we are seekingthe results of research on our
topic. These documents may be in many formsfor ex-
has or will call on the expertise of a librarian skilled ample, books, journals, conference papers, or research
in the art of bibliographic searching in his or her dis- reports. These documents may be available in hardcopy
cipline. form or electronically.
The reader is also encouraged to refer to a source such We began our professional work with reference data-
as The Oxford Guide to Library Research (Mann 2005), bases in the 1970s, and wrote the first edition of Library
USING REFERENCE DATABASES 75

Use: Handbook for Psychology, which was published in fields to provide more information about each citation.
1983. Since that time, changes have taken place in infor- Some databases added abstracts to the citations. Early
mation storage and retrieval that some might character- electronic databases only allowed searching by authors or
ize as a paradigm shift. In 1965, for example, there was subject termsthose terms that the indexers selected and
no online PsycINFO that the researcher could access di- added to documents to represent their contents. If the in-
rectly; there was only Psychological Abstracts, a monthly dexer made a judgment error in assigning subject terms
printed journal. It contained the citations and abstracts to and did not use a relevant key term, an electronic search
journal articles in the field of psychology, and both sub- would not identify the relevant document. As search en-
ject and author printed indexes. Success in identifying gines became more sophisticated and users demanded
relevant literature was a function of time, creativity, per- better access, searching of titles was added. Some data-
sistence, vocabulary, and indexer precisionhow much bases enabled searching of abstracts or even the entire text
time did the user have to search the monthly, annual of documents if they were included in the database. Free
and cumulated author and subject indexes; how accurate text searching, also known as natural language searching,
was the user in identifying terms that describe the topic became available. Although important terms that reflect
that were also used by the indexers; how persistent the content of a document may appear in the title or ab-
was the user in manually searching different sections stract of a document, they may not be terms assigned as
of the monthly source that might contain a relevant cita- subjects to a document or terms included in the controlled
tion; how complete was the controlled vocabulary of the vocabulary of a reference database.
indexers thesaurus of subject descriptors; and how pre- Today, many libraries do not subscribe to some printed
cise were the indexers in using subject terms that accu- indexes or abstracts, choosing to rely entirely on elec-
rately described the content of each document. Unfortu- tronic database access to services. Some services have
nately, there were often shortfalls in one or more of these ceased publication of printed versionsfor example, the
areas. American Psychological Association announced in De-
By the late 1970s, publishers were creating electronic cember 2006 that it would no longer publish its monthly
databases of citations used as the basis for their pub- Psychological Abstracts, but instead rely entirely on the
lished indexes and abstracts; a few even made searching electronic PsycINFO database to access citations to doc-
of these databases available for a fee. But these elec- uments in the field of psychology. Indexes and abstracts
tronic databases were searched remotely by experienced (reference databases) have changed in appearance, for-
searchers who accessed reference databases through a mat, content, and what they cover. The primary message
vendor aggregator (a service provider who bought the is that access to reference databases and their content has
database and sold access time to your library). So you changed and will continue to change.
took your request to a librarian who used a vendor such Online user guides and tutorials for particular refer-
as Dialog to search available reference databases. By the ence databases are increasingly common. They are be-
early 1980s, some reference databases were available on coming more user friendly and are updated more fre-
a CD-ROM that a library acquired from a vendor such as quently than print information. They are often available,
Silver Platter. By the 1990s, electronic versions of many like a help system, from a selection within the reference
databases became available online, with libraries sub- database. Consult these tools to understand better the
scribing to vendor aggregators such as EBSCO, Ovid, or scope and mechanics of the reference databases you use.
ProQuest. Why is this important? Specific details have changed
As time passed, the information retrieval world contin- over the past forty years, but general approaches to
ued to change. Some databases disappeared; new ones searching reference databases have remained relatively
were created. An increasing number of databases became constant. Rather than providing elaborate illustrations of
available electronically. Some vendors that provided ac- services, the appearance of which the publisher or data-
cess to databases merged or went out of business. Pub- base programmer may change tomorrow, we will present
lishers changed the appearance of database output, some generic examples. Even this handbook, unfortunately,
databases expanded coverage in their field, and the types will become a bit dated by the time it is printed. We focus
of documents included in databases changed (for exam- this chapter on general principles and techniques related
ple, by adding citations to books) over time. Some pub- to the use of reference databases, and offer examples that
lishers expanded the content of their database by adding we hope will stand the test of time.
76 SEARCHING THE LITERATURE

5.2 DATABASE SELECTION databases tends to vary within the social sciences. For
example, EconLIT, a literature indexing tool in econom-
The function, content, and structure of bibliographic data- ics, cites a large number of working paper series, an im-
bases vary. A few are merely indexes, containing only the portant communication format among economists. These
most basic bibliographic information on publications. series are frequently hosted by universities and other re-
More are abstracting services, in which bibliographic in- search institutions and freely available over the Internet.
formation is accompanied by nonevaluative summaries A subset of this phenomenon is self-archiving of schol-
of content. An increasing number of databases either arly material by academic departments or even individual
contain the entire text of a referenced item or link to the researchers. This can be as simple as making an individ-
complete text of the item. For example, the indexes to ual paper, body of original data, or computing program
proceedings and conference papers described in the ap- accessible on a public web site. Much of this material
pendix generally provide a bibliographic citation and a may never be published in the sense that it is peer re-
limited number of access points such as subject codes. viewed, published in a recognized scholarly publication,
The International Bibliography of the Social Sciences, and therefore accessible via a reference database (for dis-
which indexes almost 3,000 publications worldwide, con- cussion of such materials, known as grey literature, see
tains bibliographic information about each item, a non- chapter 6, this volume).
evaluative abstract, as well as indexing terms to aid Awareness of the scope of relevant databases helps the
searching (for example, to limit a search by language of researcher to formulate a plan of action. Conversely, there
publication, subject headings, publication year, or for- are lacunae in database subject and publication format
mat). On the other end of the spectrum, the PsycINFO coverage among databases. As a general rule, selection of
database supplies abstracts, many additional indexing op- a bibliographic database is dictated by a combination of
tions, the ability to search by cited references within in- many factors related to your research question. These in-
dexed articles for many recent years and (depending on clude the following:
the database vendor) links to the local holdings records of
an individual library or the full text of an article. Its com- Disciplinary Scope. In what disciplines is research
panion database, PsycARTICLES, permits searching for on this topic being conducted? Through what refer-
terms within the complete text of the articles themselves ence databases is the literature of the relevant fields
a fulltext search. accessed? For example, accuracy of eyewitness testi-
It is important to note, however, that the availability of mony in a court of law is a topic of interest to both
enhanced search features (notably cited-reference and psychologists studying memory and attorneys desir-
fulltext search ability and links from a bibliographic cita- ing to win their cases.
tion to the fulltext) is a reasonably recent enhancement to Access. What reference databases are available to
database production. In addition, availability of features you at your institution? Where else can you access an
varies within a database itself or between versions of a important reference database?
database distributed by different vendors. As a result, one
must carefully examine user documentation (for example, Date. What period does your topic cover? Is its intel-
a tutorial) that accompanies a database to determine pre- lectual foundation a relatively recent development
cisely what search features are available and how com- in the discipline, or has it been the subject of re-
prehensively they are applied. search and debate over a period of decadesor even
A more recent publishing phenomenon, open source longer?
electronic publishing, complicates the landscape of pub- Language and country. In many areas, significant re-
lished literature and requires some explanation here. search has been conducted outside of the USA and
Spearheaded by scholarly societies, universities, and aca- published in languages other than English. Although
demic libraries, these journals can provide all the charac- many are expanding their scope of coverage, some
teristics of traditional commercially produced products at reference databases limit their coverage of journals
a lower cost and with greater flexibility (Getz 2004). published in non-Western languages or even of those
However, exploiting their contents can represent a chal- not published in English. Is there research on your
lenge to researchers. Acceptance of open-access publica- topic published in languages other than English or
tions as items worthy of indexing in traditional reference outside of the United States?
USING REFERENCE DATABASES 77

Unpublished Work. This encompasses a varied and A complete and retrospective literature search often
ill-defined body of formats, frequently referred to entailed using both hardcopy index volumes and a biblio-
as grey literature. Is there relevant research in the graphic database. To some extent, and in some areas, this
form of conference papers, unpublished manuscripts, has changed over the last ten years. Some index publish-
technical reports, or other forms? ers have enhanced their online databases by retrospec-
tively adding the contents of their hardcopy predecessors,
PsycINFO being an excellent example. The PsycINFO
5.2.1 Subject Coverage database originally began with a coverage date back to
1967, whereas the printed volumes indexed publications
Each bibliographic database is designed for a particular
back to 1887. PsycINFO gradually added the contents of
audience and has a defined scope of subject coverage.
Psychological Abstracts so that its coverage is truly retro-
The list of selected databases provided in the appendix
spective. In recent years, PsycINFO coverage has been
reveals some obvious subject strengths among reference
extended to include Psychological Abstracts 19271966,
databases regarding subject coverage. Some, such as Psyc-
Psychological Bulletin 19211926, American Journal
INFO and MEDLINE, cover core publications within
of Psychology 18871966, Psychological Index English
their disciplines as well as titles from related and interdis-
language journals 18941935, Classic Books in Psychol-
ciplinary areas of study. Researchers and publications in
ogy of the 20th Century, and Harvard List of Books
areas such as psychology and medicine cover such di-
in Psychology, 18401971 (APA Online 2006). In fact,
verse topics; the reference databases offer a broad scope
some publishers have discontinued their printed products
of subject coverage. Other databases, such as Social Work
in favor of online versions due to publishing costs and
Abstracts, Ageline, and Criminal Justice Periodical Index,
declining subscriptions to the print. APA ceased publica-
have a narrower subject focus. Reference volumes such
tion of Psychological Abstracts in favor PsycINFO in
as the Gale Directory of Databases can be helpful in
2006 (Benjamin and VandenBos 2006).
identifying bibliographic databases of potential interest.
The National Library of Medicine (2000, 2004) dis-
In many areas of research, it is not only advantageous; it
continued publication of its Cumulated Index Medicus
is essential to access multiple reference database sources
and monthly Index Medicus in favor of MEDLINE access
to ensure comprehensive coverage of the published lit-
in 2000 and 2004, respectively. Although the MEDLINE
erature. This is especially important for topics that are
database provided comprehensive coverage of the journal
interdisciplinary (such as stress, communication, conflict,
literature back to 1966, a variety of earlier reference tools
leadership, group behavior, learning, public policy). As
(for example, Quarterly Cumulative Index to Current
Howard White notes in an earlier chapter, it is important
Literature covering 1916 to 1926 and Bibliographica
to avoid missing relevant material that lies outside ones
Medica from 1900 to 1902) extends the reach of medical
normal purview.
literature indexing much earlier in time, before produc-
tion of the National Library of Medicines Index Medicus
5.2.2 Years of Coverage and MEDLINE (National Library of Medicine 2006).
An additional illustration is Index of Economic Articles
Depending on the nature of a topic, the number of years in Journals and Collected Volumes, which began publica-
covered by an indexing source may be important to the tion in 1924 with some coverage extending back as far as
literature search process. Many bibliographic databases 1886, whereas EconLIT coverage extends back to 1969.
began production in the late 1960s and early 1970s, thus, The American Economic Association (AEA) discontin-
it is common that their coverage of the published journals ued Index of Economic Articles in favor of EconLIT in
extends for more than thirty years. For example, the ERIC 1996. Examples of printed indexes that precede the cov-
database initiated in 1966, covering education and related erage provided by reference databases are numerous. In-
areas, contained the same information as the printed in- formation about them is often included in help or tutorial
dexes produced from that database. Many databases, guides that accompany reference databases. Additional
however, were preceded by printed indexes, which were sources of such information include the expertise pro-
published on a monthly or quarterly basis for many years vided by library reference staff, discipline-specific guides
and cumulated annually before electronic databases were to reference literature, or more general reference guides.
available, or even possible. One example of a discipline-specific volume that provides
78 SEARCHING THE LITERATURE

such guidance is Sociology: A Guide to Reference and are not as well represented bibliographically as others.
Information Sources (Aby, Nalen, and Fielding 2005). Articles from well-established, refereed journal titles are
The broader social sciences are covered by several cur- by far the most heavily represented format in reference
rent guides, including those by Nancy Herron (2002), and databases. The empirical literature, however, is not re-
David Fisher, Sandra Price, and Terry Hanstock (2002). stricted to one publication type, so finding research in
The inclusive Guide to Reference Books (Balay 1996) is other formats and procuring the documents is critical, al-
now in its eleventh edition. though it can be challenging. Although this topic is cov-
Many database producers, however, have limited their ered extensively in the context of the grey literature in
retrospective coverage, because the cost is high relative to chapter 6 of this volume, a few generalizations warrant
potential demand for information about older research mention here.
material. Still others have elected to provide limited ret- Subject access to information on book-length studies
rospective coverage; for example, by adding citations, but is available within several comprehensive, international
not summaries or abstracts of older documents. Thus, databases. WorldCat, a service of the Online Computer
you must know both the timeframe over which research Library Center (OCLC), allows you to search the catalogs
on your topic extends and the coverage dates of the refer- of many libraries throughout the world. It not only contains
ence database you are using. records for books, manuscripts, and nonprint media held
The limits of database coverage can tempt a researcher by thousands of libraries, it has also recently incorporated
to restrict search strategies to particular reference data- information about websites, datasets, media in emerging
bases based on publication datesfor example, 1965 to formats, and so forth. These are important sources because
present. Unless the topic is one that has emerged within books may contain empirical evidence relevant to the re-
the defined time frame and there is no literature published search synthesis. However, finding information about con-
outside that time, this is a poor practice. Date limiting a tent within individual chapters remains difficult, a situa-
project can lead to the failure to include important re- tion exacerbated by inconsistent coverage both among and
search results in a research synthesis, and thus potentially within disciplines. For example, PsycINFO (and to a more
significantly bias its results. limited extent, APAs PsycBOOKS database) affords de-
When considering the criteria for inclusion and cover- tailed coverage of book chapters but is by no means
age of a reference database, carefully note the dates when exhaustive in its retrospective coverage of books or pub-
a particular journal began to be included and whether ti- lishers. Other reference databases, such as International
tles are afforded cover-to-cover or selective indexing. Bibliography of the Social Sciences, incorporate book
Just to illustrate this point, Acta Psychologica began pub- chapter coverage. However, no one or two databases pro-
lication in 1935, though PsycINFO indexes its contents vide adequate coverage of this publication format.
beginning with 1949 (APA Online 2007). At present, Coverage of research and technical reports in the social
Acta Psychologica contents are indexed in their entirety. sciences is uneven. Their unpublished nature dictates
However, it is difficult to determine in what index, whether limited bibliographic control and distribution. Interim or
printed or electronic, the early issues of Acta Psycholog- final reports produced by principle investigators on grants
ica are indexed or if PsycINFO began its cover-to-cover may eventually be indexed in the National Technical In-
indexing of its contents in 1949 or at some later time. formation Service (NTIS) database, an index to more
Such journal titles might be candidates for hand search- than 2 million reports issued since the 1960s (NTIS 2006).
ing if they are highly relevant to the project at hand and Established report series are most frequently indexed in
completeness of their indexing cannot be verified. databases such as NTIS and EconLIT and to a lesser ex-
tent, PsycINFO and ERIC. The launch of APAs Psyc-
5.2.3 Publication Format EXTRA product in 2004 was intended, at least in part, to
address this gap. The large numbers of report series gen-
Some bibliographic databases index only one type of erated by university departments, research institutes, and
publication; for example, dissertations, books, research associations are difficult to identify, let alone index and
reports, conference papers. Others provide access to a locate. Their wider availability on Web pages ameliorates
wide variety of formats, although the coverage of every this situation to some extent, though much content re-
format may not be consistent across all years of a particu- mains in the deep Web beyond the reach of most stan-
lar database. Suffice it to say that some research formats dard Internet search engines.
USING REFERENCE DATABASES 79

Like research reports, papers presented at scholarly tailored to the needs of their users (faculty, staff, students,
conferences have traditionally been difficult to identify and general users). These arrangements allow users of
and even harder for researchers to acquire. Information that particular institution to search and print or save cita-
about conference proceedings and individual papers re- tions, abstracts, or even the complete text of a publication
mains difficult to locate, despite improved bibliographic to use in their research. Researchers at large, well-funded
access and the posting of papers on individual or institu- research institutions serving a wide-range of disciplines
tional websites. A few databases, notably Proceedings are more likely to have access to both core (such as, Psyc-
First and Papers First, index this material exclusively. INFO, Biological Abstracts) and specialized databases
However, representation of this research communication (for example, EI Compendex, Derwent Biotechnology
format remains limited in most databases. Abstracts, Social Work Abstracts). However, differential
The literature review in many dissertations is exten- pricing makes many of the core databases affordable even
sive, and the quality of the empirical research is often for smaller institutions or those with less comprehensive
extremely high. Ignoring such documents in a research program offerings. For independent scholars, or those un-
synthesis is ill-advised. Although some consider this a affiliated with an academic institution, there are other op-
part of the grey literature, for many years access has been tions that permit individual subscriptions. In addition,
provided through Dissertation Abstracts International some major databases, notably ERIC and MEDLINE, are
(DAI). More recently access is available through the elec- produced on behalf of the federal government and can be
tronic ProQuest Dissertations & Theses. Dissertations used on the Internet at no charge.
and theses, however, are often not a part of the local li- The cost of sources such as the electronic Web of Sci-
brarys collections, and must be purchased. Although it is ence, basis for the printed Science Citation Index and So-
desirable to obtain a dissertation through interlibrary loan cial Science Citation Index, however, has become so high
(ILL), some libraries do not lend theses and dissertations. that many institutions have ceased subscribing. The cu-
In Europe, part of the process in many universities re- mulative cost of reference databases to libraries may limit
quires that a doctoral dissertation be published for the your access to more specialized tools. You may have to
granting of the degree, making access to these documents find an alternate way to access an important, relevant ref-
a bit easier through searching a national bibliography of erence database, such as contracting with a commercial
published monographs. As a general rule, however, re- service to conduct a search on your behalf or initiating
searchers requiring access to a large number of doctoral short-term access as an individual subscriber. Profes-
dissertations should be prepared for a lengthy wait to ob- sional associations may provide access to a limited num-
tain them through interlibrary loan, significant costs to ber of databases as a membership benefit or at a reduced
purchase them, or both. cost. For example, the American Economic Association
Locating materials in grey literature formats (such as provides discounted limited access to EconLit, and Amer-
technical reports, conference papers, grant reports, and ican Psychological Association has subscription options
other documents) can be difficult. They are often not in- for individuals, whether or not they are APA members.
cluded in major reference databases. In chapter 6, Roth- Your college or university library may be able to provide
stein and Hopewell provide valuable guidance on search- you, as a scholar researcher, an introduction to a nearby
ing the grey literature, and suggest approaches that sup- research library that has the resources you need; ask a
plement your use of reference databases. librarian for assistance.

5.2.4 Database Availability 5.2.5 Vendor Differences


Ideally, selection of the reference databases is dictated Many databases are available from more than one vendor.
solely by the research inquiry; that is, the subject scope, As mentioned earlier, ERIC and MEDLINE are examples
years of coverage, and so forth. An additional consider- of two that are freely available for Internet searching.
ation is the accessibility of files to the individual researcher In addition, these files are also accessible from commer-
and, by extension, the availability of the publications they cial vendors (for example, EBSCO, Ovid). In the case of
index. files accessible via more than one source, one would as-
Most academic libraries and larger public libraries sub- sume that the same search strategy implemented on the
scribe to a suite of bibliographic databases, their selection same database would yield the same body of citations.
80 SEARCHING THE LITERATURE

However, this is not always the case, for reasons that are Search Profile: How will the search be structured (that
difficult to identify. Frequency of updates or dates of cov- is, how will search terms be combined logically)?
erage are obvious sources of differences. Other less obvi- Sources: What bibliographic sources will be
ous differences may be search defaults (false memory searchedreference (bibliographic) databases, web
with quotation marks might yield a larger set than the sources? What time period will be covered? How
same terms without the quotation marks) or the differen- will electronic sources be supplemented? What
tial use of stop words (words that are ignored regardless sources will be consulted to access the grey or fugi-
of how they are used in the search). Just as different Inter- tive literature?
net search engines will yield differing results to the same
search, reference database search engines employing dif- Methods: How will the search be conducted? In
ferent search algorithms provided by different vendors what order will elements of the search be pursued?
may result in different search results. Your primary strate- It is not uncommon for the search strategy to evolve
gies to ensure completeness will be based on the methods during the research process. Failure to formulate a well-
described shortly, such as a careful search strategy, mul- considered plan for research can result in
tiple searches, and multiple reference databases.
a search process that is far too broad and wastes an
inordinate amount of time by requiring examination
5.3 STRUCTURING AND CONDUCTING and rejection of many false positives retrieved;
THE SEARCH
a search that is far too narrow with many (unknown)
Think of a bibliographic search as a treasure hunt. In false negatives and results in an incomplete or biased
the case of research synthesis and meta-analysis, how- synthesis;
ever, there is no one single treasure chest. It is as though
a research syntheses paper that is poorly focused, ne-
many pieces of treasure were dumped overboard and the
glects important sources, or is too complex because
task is to retrieve as many pieces as possible from the
it attempts to accomplish too much.
ocean. Unfortunately, there are reefs hiding pieces, cur-
rents have pushed pieces about, and since it was strewn In the sections that follow, we describe the process of
about many years ago, much of the treasure is covered by conducting the search and illustrate how each of these
sand and plants. It will take a lot of effort to retrieve each may be pursued.
piece of treasure. But because the pieces of treasure com-
plete a puzzle, it is important to find every piece of the 5.3.2 Steps in the Search Process
puzzle that you possibly can to unlock the secrets of the
puzzle. 5.3.2.1 Topic Definition The first step in conducting
a search is to clearly and precisely define the topic. Here
5.3.1 Search Strategy one determines the scope of what aspects of the subject
are included or excluded, the subject population, the
Unlocking the secrets of the puzzle requires a plan. methodology employed, and so forth. Defining the topic
Defining a plan for finding relevant research evidence to requires knowledge of the subject gained through pre-
include in the research synthesis is a key element of the liminary examining and evaluating available literature. It
process. This plan is the search strategy. It is a road map may start with a textbook chapter, a literature review
that defines how one will pursue the process of seeking article, or a conference paper. It should be followed by
information. Developing a solid plan will aid you in pro- more reading and study of related literature. Many re-
ceeding logically, prevent unnecessary retracing of steps searchers would support it with a quick preliminary lit-
already covered, and aid in the quest for completeness. erature search. The goal of this first search is to begin to
What should be included in the search strategy? understand the scope of the topic, where the controver-
sies lie, what research has been conducted, and how large
Research Question: What is the specific topic? How the body of literature is. One should then state the topic in
is it constrained? the form of a sentence or question.
Descriptors: What terms are used to describe key as- In this chapter we will use a sample topic of eyewitness
pects of this topic? testimony. There are many aspects of this topic we could
USING REFERENCE DATABASES 81

consider. In their research synthesis of studies on the va- What types of terms should be included? In the field of
lidity of police lineups for the purpose of eyewitness health care, when examining health-care interventions,
identification of perpetrators, Gary Wells and Elizabeth The Cochrane Handbook suggests that one should in-
Olson considered characteristics of the witness, the event clude three types of concepts in an electronic reference
that occurred, and the testimony that was given (2003). database searchthe health condition under investiga-
They also considered system variables such as lineups, tion, the intervention being evaluated, and the types of
pre-lineup instructions given to witnesses, the way in studies to be included (Higgins and Green 2006). This
which the lineup is presented, and so forth. Asher Koriat, advice may be adapted for interventions in other areas,
Morris Goldsmith, and Ainat Pansky, in their synthesis of such as the social sciences, social policy, and education.
studies on the accuracy of memory, considered factors In each, we are often concerned with a problem to be ad-
such as misattributions, false recall, and the provision of dressed (psychological, behavioral, developmental, so-
misleading postevent information that can affect eyewit- cial, or other), an approach being considered to address
ness testimony (2000). Siegfried Sporer and his colleagues the problem, and methodological approaches that may
focused on confidence and accuracy in eyewitness iden- provide acceptable data for the research synthesis. Not all
tification (1995). Elizabeth Loftus differentiated among research syntheses, however, are focused on interven-
perceiving of events, retaining information, retrieving in- tions. In chapter 2 of this volume, Harris Cooper discusses
formation, and difference among individuals as they affect alternative research approaches that may provide an as-
accuracy of testimony (1996). One could look at anxiety sessment of descriptive or other nonquantitative research.
and stress, arousal, use of leading questions, or focus on The research synthesist must understand the topic well
children as eyewitnesses (Siegel and Loftus 1978; John- enough to cover its important domains.
son 1978; Loftus 1975; Dale, Loftus, and Rathbun 1978). It is important to note that literature on many topics is
For this chapter, we start by defining a fairly broad sample scattered. When researchers in different disciplines con-
topicmemory accuracy of adult eyewitness testimony. duct research on a topic, their frame of reference and the
We have limited our study of the testimony given by eye- vocabularies of their disciplines differ. Articles on similar
witnesses to the accuracy of the report given by adults and or related topics are published in different journals, cited
a focus on the memories of the eyewitnesses. in different indexes, and read by different audiences.
5.3.2.2 Search Terms The second step is to identify When a new research area is emerging, as different re-
terms descriptive of the topic. In a typical bibliographic searchers coin their unique terms to describe a phenom-
search, the goal is to use the right set of terms in the right enon, lack of consistency of vocabulary is a problem. Yet
way to accurately describe the topic at the appropriate the research in these different fields may be related.
level of specificity. Classically, we strive for high preci- Because of this problem of inconsistent vocabulary, it is
sion in order to recall a maximum of hits (relevant important that the research synthesist carefully construct a
sources), a minimum of false positives (sources identified set of relevant search terms that represents different ap-
but irrelevant), and a very small number of false nega- proaches to the topic, and includes synonyms that may be
tives (relevant sources not identified) to maximize effi- used in early stages of the development of the area. An
ciency. In research synthesis, however, the paradigm is example of this may be seen in the 1970s where lines of
different. Because the goal is thoroughness (complete- related psychological research used the terms chunking
ness), the primary need is to reduce the number of false versus unitizing. Claudia Cohen and Ebbe Ebbesen used
negatives (relevant sources not identified), so the research the term unitizing to describe schemas provided to sub-
synthesist must be willing to examine a larger number of jects to be used in organizing streams of information to
false positives. To reduce false negatives, one must ex- be processed (1979). They found that providing different
pand the search by including all relevant synonyms and unitizing schemas resulted in different processing of in-
appropriate related terms. One must at the same time be formation and subsequently different assessments of per-
concerned about precision; as the net expands with more sonality of persons observed. Robert Singer, Gene Kori-
alternate search terms (for example, synonyms), the num- enek, and Susan Ridsdale used different learning strategies
ber of citations retrieved can be expected to increase. (one of which was chunking) to assess retention and re-
Each must be checked to determine its relevance, a pro- trieval of information as well as transfer of information to
cess that both increases the costs of the synthesis and the other tasks (1980). Both studies involved cognitive pro-
potential for errors to occur when judging relevance. cessing and organization of information and its impact on
82 SEARCHING THE LITERATURE

Table 5.1 Subject Terms in PsycINFO

Search Citation Subject Terms

Unitizing Cohen, Ebbesen (1979) Classification (cognitive process)


Experimental instructions
Impression formation
Memory
Personality traits
Chunking Singer, Korienek, Ridsdale (1980) Fine motor skill learning
Imagery
Retention
Strategies
Transfer (learning)
Verbal communication

source: Authors compilation.

later judgments. It is true that the two groups studied dif- Note that for the purpose of accuracy, we include both
ferent facets of the psychological phenomenon and pur- the ISSN (International Standard Serial Number) for the
sued differing theoretical propositions. However, though serial publication (journal) in which each of these articles
these lines of research are related, the search terms at- was published, and the accession number assigned to the
tached to the two studies in the PsycINFO database were citation within the index (for example, PsycINFO, Wil-
totally different, as noted in table 5.1. son Web OmniFile Full Text Mega) in which the citation
The definition of the topic should was identified. The ISSN is a unique eight-digit number
assigned to a print or electronic serial publication to dis-
reflect the scope as well as the limits of the research
tinguish it from all other serial publications. An accession
precisely,
number is a unique identifier assigned to each citation
include all important concepts and terms, within a reference database that differentiates one cita-
indicate relationship among concepts, and tion from all others. (Keeping track of this is especially
important if you need to request the Interlibrary Loan
provide criteria for including and excluding materials.
(ILL) of a document, to avoid confusion and increase the
Returning to our sample topic on eyewitness testimony, probability of your receiving exactly what you request.
several primary concepts lie at its core: memory, accu- Some ILL departments may request either the ISSN or
racy, adult, eyewitness, and testimony. To expand the list the name of the reference database and citations acces-
of possible search terms and identify those most fre- sion number.)
quently used, based on preliminary reading and quick To expand the list of subject search terms, we consult
preliminary searches, we started with several documents thesauri used by sources likely to contain relevant cita-
relevant to our topic. The citations to the Loftus book and tions to documents. Many reference databases, such as
the articles noted earlier were found in PsycINFO. PsycINFO and MEDLINE, use a controlled vocabulary of
We examined the subject terms attached to these docu- subject terms that describe the topical contents of docu-
ments in their source citations by the source indexers. We ments indexed in the database. This controlled vocabulary
found a wide variety of subject terms, which are pre- defines what terms will or will not be used as subject in-
sented in table 5.2. It is interesting to note that in five dexing terms within the reference database and how they
documents, terms such as eyewitness, witness, testimony, are defined for use. A thesaurus is an alphabetical listing
and accuracy were not among those subjects assigned by of the controlled vocabulary (subjects, or descriptors)
the PsycINFO indexers to the documents. Thus, retrieval used within the reference database. A thesaurus may use a
of these documents on eyewitness testimony would have hierarchical arrangement to identify broader (more gen-
been difficult using these subject terms. eral), narrower (more specific) and related subject terms.
USING REFERENCE DATABASES 83

Table 5.2 Citations to Documents in a Preliminary Search on Eyewitness Testimony

Document Source Subject Terms

Buckhout, Robert. 1974. Eyewitness Testimony. Scientific American 231(6): PsycINFO Legal processes
2331. ISSN: 0036-8733. PsycINFO Accession # 1975-11599-001. Social perception
Johnson, Craig L. 1978. The Effects of Arousal, Sex of Witness and PsycINFO Crime
Scheduling of Interrogation on Eyewitness Testimony. Dissertation Emotional responses
Abstracts International 38(9-B): 44274428. Doctoral Dissertation, Human sex differences
Oklahoma State University. ISSN: 0419-4217. PsycINFO Accession # Observers
1979-05850-001. Recall (learning)
Leinfelt, Fredrik H. 2004. Descriptive Eyewitness Testimony: The Influence Wilson Web Identification
of Emotionality, Racial Identification, Question Style, and Selective OmniFile Witnesses-psychology
Perception. Criminal Justice Review 29(2): 31740. ISSN: 0734-0168. Eyewitness identification
Wilson Web OmniFile Full Text Mega Accession # 200429706580003. Race awareness
Recollection (psychology)
Witnesses
Loftus, Elizabeth F., and Guido Zanni. 1975. Eyewitness Testimony: The PsycINFO Grammar
Influence of the Wording of a Question. Bulletin of the Psychonomic Legal processes
Society 5(1): 8688. ISSN: 0090-5054. PsycINFO Accession # Memory
1975-21258-001. Sentence structure
Suggestibility
Loftus, Elizabeth F., Diane Altman, and Robert Geballe. 1975. Effects PsycINFO Interviewing
of Questioning Upon a Witness Later Recollections. Journal of Police Memory
Science & Administration 3(2): 162165. ISSN: 0090-9084. PsycINFO Observers
Accession # 1976-00668-001.
Wells, Gary L. 1978. Applied Eyewitness-Testimony Research: System PsycINFO Legal processes
Variables and Estimator Variables. Journal of Personality and Social Methodology
Psychology 36(12): 15461557. ISSN: 0022-3514. PsycINFO Accession #
1980-09562-001.

source: Authors compilation.


note: Examples identified on eyewitness testimony with associated subject terms assigned by source indexers.

We first examined the online thesaurus of PsycINFO for early memories, short-term memory, visuospatial mem-
the terms eyewitness, testimony, and memory accuracy. ory, and so forth. Examples are provided in table 5.3.
This identified for us the date when each search term was As noted earlier, different sources often use different
added to the index vocabulary. Before the date it was in- subject search terms. This can be seen in table 5.4, where
cluded, a search term would not have been assigned to we compare terms used in PsycINFO, MEDLINE, Crim-
any citations to document and thus would not be useful in inal Justice Periodicals Index, and Lexis-Nexis Academic
retrieving citations in that reference database source. This Universe. (Academic Universe provides worldwide, full-
is especially important if searching is limited to subject text access to major newspapers, magazines, broadcast
field searching. In a scope note, the thesaurus defined the transcripts, and business and legal publications. It is in-
meaning of the term and how it is used. It also identified cluded here as an example of a general information re-
preferred and related terms. Of interest is the fact that eye- source, as compared to the specialized, research-oriented
witness is not used in PsycINFO; instead the term used is services used by those searching the research literature.)
witness. Expert testimony is used, but we cannot find the Note that PsycINFO prefers to use the age group field to
term testimony. Memory accuracy is not used, but a vari- limit the search as adult (18 years and older) rather than
ety of related terms are availablefor example, memory, using the subject term adult. Both PsycINFO and MED-
84 SEARCHING THE LITERATURE

Table 5.3 Subject Terms in PsycINFO Thesaurus

Initial Term Year


Investigated Preferred Term Added Related Term if any Scope

Eyewitness Witness 1985 Persons giving evidence in a court of


law or observing traumatic events in a
nonlegal context. Also used for analog
studies of eyewitness identification
performance, perception of witness
credibility, and other studies of
witness characteristics having legal
implications.
Testimony Term is NOT included in the PsycINFO
Thesaurus
Legal testimony Legal Testimony 1982 Evidence presented by a witness under
oath or affirmation (as distinguished
from evidence derived from other
sources) either orally or written as a
deposition or an affidavit.
Memory accuracy Memory Term is NOT included in the
Early memories PsycINFO Thesaurus
Short-term memory
Visuospatial memory
Autobiographical memory
Explicit memory
False memory
Iconic memory
Implicit memory etc.

source: Authors compilation.

LINE have carefully developed thesauri of subject search Venn (1880) to provide what is known today as boolean
terms, allowing us to learn how subject terms are used logic. The boolean AND operator is used to link concepts
within the PsycINFO and MEDLINE indexes. Neither as an intersectionboth concepts must be attached to re-
Criminal Justice Periodical Index or Lexis-Nexis, how- trieve a document. For example, with eyewitness AND tes-
ever, have such a source available to the user. When a timony both terms must be present for a citation to be re-
reference database does not provide a thesaurus, it is es- trieved. AND is generally used to restrict and narrow a
pecially important to use preliminary searches to identify search. The boolean OR operator is used to link concepts
relevant citations and ascertain from them what subject as a unioneither concept being found related to the doc-
terms are being used by the database to index documents ument would retrieve the document. For example, reliabil-
relevant to your topic. ity OR accuracy would retrieve any citation that included
5.3.2.3 Search Profile The third step is to provide a either of these terms. OR is generally used to expand and
logical structure for the search. Having identified an initial broaden a search. The boolean NOT is used to exclude a
set of search terms using a variety of methods noted ear- particular subject term. For example, NOT children would
lier, we next would use boolean operators to link concepts. exclude from a search terms that included reference to
Boolean algebra was invented by George Boole (1848), children. The boolean NEAR may be used in some in-
refined by Claude Shannon (1940) for use in digital circuit dexes to link terms that may not be adjacent, but must be
design, and linked with Venn Diagrams developed by John near each other within the text to be retrieved. In some
USING REFERENCE DATABASES 85

Table 5.4 Subject Search Terms in Different Sources


Source Memory Accuracy Adult Eyewitness Testimony

PsycINFO Memory Limit search by: Witness Expert testimony


(Use thesaurus) Narrower terms: Age Group:
Autobiographical memory Adult (18 years
Early memories & older) AND
Episodic memory Population:
Explicit memory Human
False memory
Implicit memory
Memory decay
Memory trace
Reminiscence
Repressed memory
Short-term memory
Verbal memory
Visual memory
Visuospatial memory
Related terms:
Cues
Forgetting
Hindsight bias
Procedural knowledge
Serial recall
MEDLINE Eyewitness not used Testimony not used
(Use MESH Witness not used Expert testimony
Medical Subject used
Headings)
Criminal Justice Memory Adult Eyewitness Testimony
Periodical Index Witness Report
ProQuest Evidence
(No thesaurus
available)
Lexis-Nexis Memory Eyewitness
(No thesaurus Reliability Witness
available) Accuracy

source: Authors compilation.

reference databases a wild card such as an asterisk (*) AND. Additionally, there are alternate terms related to
may be used to include multiple variants using the same each of these concepts. Because different disciplines, au-
truncated root or stem. For example, using adult* would thors, and reference databases use differing terminology,
retrieve citations containing any of the following terms: we need to include all of these, as noted conceptually in
adult, adults, adultery, adulterer. The risk when using wild figure 5.1. These sets of alternative terms would be linked
cards is that irrelevant terms may inadvertently become a with OR operators within our search. The portion in the
part of the search strategy, as in the preceding example. middle of the figure, the intersection where all five con-
In our example, we have identified five important con- ceptual areas overlap, represents the set of items that we
cepts. These are eyewitness, testimony, memory, accu- would expect should be retrieved from our search. Our
racy, and adult. These would be logically joined by an search may be incomplete when it fails to include one or
86 SEARCHING THE LITERATURE

5.3.2.4 Main Searches The fourth step is to begin


Eyewitness Testimony searching. The preliminary searches have begun to ex-
Eyewitness Testimony plore the field, led to some key documents, helped define
Witness Report the topic, explored which reference databases will be
Observer Evidence most useful, and identified important search terms. Now
Victim
it is time to begin extensive searching to execute the
search strategy developed. Each major discipline that has
an interest in the topic has a reference database and that
database must be searched to retrieve the relevant litera-
Memory ture of the discipline.
Memory Reliability Important terms that reflect the content of a document
Recall Reliability may appear in the title or abstract of a document, though
Recollection Accuracy
they may not be terms assigned as subjects to a document
False memory Distortion
False recall
or terms included in the controlled vocabulary of a refer-
ence database. In the age of the printed index and early
years of electronic reference databases, one was limited to
Adult searching by subject search term. Searching the subject
Adult index to find terms the indexers used was limited by the
Not Child
controlled vocabulary of the index and the accuracy of in-
dexers. Today the researcher has an ability with an elec-
Figure 5.1 Venn Diagram Illustrating the Intersection of tronic reference database to expand the search by looking
Five Concepts and Their Related Terms in Eyewitness Testi- beyond the subject or author index to scan the title or ab-
mony Search Strategy stract (if your reference database includes abstracts), and
in some cases even the document itself if it is available
source: Authors compilation. electronically, for relevant terms whether they are included
in the controlled vocabulary (subject search terms) or not
included (free-text or natural language search terms).
more key terms in one of the conceptual areas, or when In a first pass of the main search phase, we conducted
the citation itself includes a term or terms different than searches of four reference databasesPsycINFO, Crimi-
our search parameters because the authors or indexers are nal Justice Periodical Index, MEDLINE, and ProQuest
using a different vocabulary. Dissertations & Theses. Our first search used the strat-
There may be an implied relationship among concepts egy of eyewitness testimony memory accuracy. These
in a search. A computer bibliographic search, however, searches were disappointing. Only seven citations were
will not generally deal with the relationship among con- identified in these databases. Because we knew that there
cepts. For the most part, the current generation of biblio- are many other documents available, we also knew that
graphic search tools available in 2008 is limited to bool- these initial searches were too narrowly focused. We de-
ean relationships among identified concepts. This will cided to expand our search strategy.
limit the precision of a search. For example, suppose the We knew that changing the search strategy would
research topic had been the impact of teacher expecta- change the search results in a research database. Using
tions on academic performance of children in differing the example of eyewitness testimony, we conducted sev-
racial groups. The technology of the search software eral searches of the same database with slight modifica-
generally available today does not allow making a dis- tions to the search structure in order to find additional
tinction between two possible topics: the influence of citations to documents and to illustrate how changing
knowledge of academic performance scores on the ex- search parameters can affect search results. This is illus-
pectations of teachers concerning their students perfor- trated in table 5.5, reporting the results of several differ-
mance and the impact of teachers expectations in influ- ent searches of the PsycINFO database. In each instance
encing students academic performance. We must examine we conducted a free text search, included all dates of pub-
each document identified to determine whether it is re- lication, and searched all text fields. In the six searches
lated to the former or the latter topic. conducted, the number of citations retrieved ranged from
USING REFERENCE DATABASES 87

Table 5.5 Comparison of Eyewitness Testimony Search Results in PsycINFO

# Search Terms Field searched Dates Results

1 Eyewitness AND Testimony All text fields All 795 citations


2 Eyewitness AND Testimony AND Memory All text fields All 430 citations
3 Witness AND Testimony AND Memory All text fields All 244 citations
4 Eyewitness AND Testimony AND Adult All text fields All 142 citations
5 Eyewitness AND Testimony AND Memory accuracy All text fields All 8 citations
6 Witness AND Testimony AND Memory accuracy Subject field All 0 citations

source: Authors compilation.

795 citations in the least restrictive search to zero cita- documenting research syntheses were developed. Team-
tions in the most restrictive search. based research syntheses on health care issues were coor-
By now, given the large number of citations identified, dinated. A database of clinical trials was developed as a
it should be clear that the sample search topic is far too repository of research studies that meet standards for in-
broad and multifaceted to result in a focused synthesis. clusion in meta-analyses. A second database of research
This would result in examining hundreds of documents, synthesis reports sponsored by the Cochrane Collabora-
some that are not useful. The documents retrieved ad- tion was developed. Today the Cochrane Collaboration is
dress multiple aspects of eyewitness testimony. For a a worldwide collaborative research effort with offices in
credible research synthesis, the topic would probably several countries. Information about the Cochrane Centre
need to be focused more narrowly. Earlier meta-analyses and Cochrane Library is available online (http://www
have focused on issues such as the relationship between .cochrane.org).
confidence and accuracy (Sporer et al. 1995), eyewitness The Campbell Collaboration was established in 2000
accuracy in police showups and lineups (Steblay, Dysart, to focus on education, criminal justice, and social wel-
and Lindsay 2003), magnitude of the misinformation ef- fare, and to provide summaries of research evidence on
fect in eyewitness memory (Payne, Toglia, and Anastasi what works. The structure of the Campbell Collaboration
1994), modality of the lineup (Cutler et al. 1994), impact is similar to Cochranes, with multinational research syn-
of high stress on eyewitness memory (Deffenbacher et al. thesis teams and databases of resources (for more, see
2004). We need to identify a unique aspect of the topic http://www.campbellcollaboration.org).
not previously investigated. As part of the process, consider whether to consult the
There is also a need to be aware of earlier research databases of either the Cochrane or Campbell Collabo-
syntheses. As this handbook demonstrates, the field is ration to find a research synthesis that may have been
growing rapidly and the number of people worldwide conducted or to identify studies that are relevant to your
who have an interest in evidence based practice is in- project.
creasing. At the same time that the number of published Because the focus of this book is research synthesis
meta-analyses and research synthesis papers is increas- and meta-analysis, it is expected that the researcher will
ing, collaborative international efforts to support this prac- focus on empirical research for inclusion in his or her
tice have been initiated. Along with these efforts, unique meta-analysis. The types of control methods, data, and
reference databases have been created by the Cochrane analysis that will be included in the research synthesis
and Campbell Collaborations. must be considered to define the standards for including
Cochrane and Campbell. Observing the rapid expan- and excluding research reports. Based on our experience,
sion of information in health care, the Cochrane Collabo- many citations identified and their associated documents
ration was established in 1993 to support improved health do not report studies that provide empirical data or
care decision making by providing empirical summaries meet the criteria of the research synthesist for inclusion
of research. Funded by the British National Health Ser- in a meta-analysis. Many citations retrieved in sample
vice, the U.K. Cochrane Centre fostered development of a searches represent opinion pieces, comments, reviews of
multinational consortium. Guidelines for conducting and literature, case studies, or other nonempirical pieces, not
88 SEARCHING THE LITERATURE

Table 5.6 Targeting a search of Eyewitness Testimony in PsycINFO


Search Parameters Targeting Results

Reference Database
(Eyewitness OR Witness OR Observer OR Victim) AND Age Group: Adult (18 years & older) AND 0 citations
(Testimony OR Report OR Evidence) AND (Memory OR Population: Human
Recall OR Recollection OR False Memory OR False Methodology: Treatment Outcome /
Recall) AND (Reliability OR Accuracy OR Distortion) Clinical Trial
(Eyewitness OR Witness OR Observer OR Victim) AND Age Group: Adult (18 years & older) AND 29 citations
(Testimony OR Report OR Evidence) AND (Memory OR Population: Human
Recall OR Recollection OR False Memory OR False Methodology: Quantitative Study
Recall) AND (Reliability OR Accuracy OR Distortion)
(Eyewitness OR Witness OR Observer OR Victim) AND Age Group: Adult (18 years & older) AND 153 citations
(Testimony OR Report OR Evidence) AND (Memory OR Population: Human
Recall OR Recollection OR False Memory OR False Methodology: Empirical Study
Recall) AND (Reliability OR Accuracy OR Distortion)

source: Authors compilation.

reports of controlled research The search may need fur- by excluding from the search non-experimental, non-con-
ther refinement to identify the types of studies appropri- trolled reports such as literature reviews, commentaries,
ate for a meta-analysis, a further refinement of the search practice guidelines, meta-analyses, and editorials (2005).
profile discussed earlier in section 5.3.2.3. The Cochrane When one expands beyond health care and MEDLINE,
Collaboration views this as targeting or developing a the process of refining the search to focus on sources that
highly sensitive search strategy. would be included in a meta-analysis is more difficult be-
The most extensive work on targeting bibliographic cause methodology descriptors appear not to be consis-
searches has been done in the field of health care. The tently applied within reference databases. We used the
appendix to the Cochrane Handbook for Systematic sample topic on eyewitness testimony and focused the
Reviews of Interventions (Higgins and Green 2006), pro- search within PsycINFO (the provider was EBSCO) by
vides a highly sensitive search strategy for finding ran- using the following limiters: age group adult (over age
domized controlled trials in MEDLINE. There is differ- eighteen), subject population human, and methodol-
entiation between different vendorsSilverPlatter, Ovid, ogy type. Table 5.6 reports the results of three searches
and PubMedwho provide access to MEDLINE. The specifying different methodologiestreatment outcome
Cochrane strategy suggests addition of certain key terms clinical trial (no citations retrieved), quantitative study
and phrases within the search to target it to relevant types (twenty-nine citations retrieved), and empirical study
of studies, such as randomized controlled trials, con- (153 citations retrieved). Note that this research synthesis
trolled clinical trials, random allocation, double blind topic is not focused on clinical trials, and so this limiter
method, and so forth. The terms selected have been vali- produced no results. By not selecting methodologies such
dated in studies of the effect of retrieving known citations as qualitative study, or clinical case study, citations iden-
from the MEDLINE database using the prescribed ap- tified by PsycINFO indexers as using these methodologi-
proaches (Robinson and Dickersin 2002). This approach cal types are excluded from our search. One might find
takes advantage of subject headings available and as- additional citations by using a key limiting term within
signed to citations in MEDLINE. Julie Glanville and her the search itself instead of as a limiter.
colleagues have extended this work, validating that the Examining other research databases, the situation dif-
best discriminating term is Clinical Trial (Publication fers. ERIC (EBSCO vendor) and Social Work Abstracts
Type) when searching for randomized controlled trials in (WebSpirs vendor) do not provide methodology fields as
MEDLINE (2006). Giuseppe Biondi-Zoccai and his col- search limiters. To limit by methodology in the field of
leagues suggest a simple way to improve search accuracy, education or social work using these reference databases,
USING REFERENCE DATABASES 89

one would need to specify methodology limiters within enabling you to collaborate with others to tap into that
the set of search terms used. Lists such as those provided previously invisible research institute in another country.
in The Cochrane Handbook (Higgins and Green 2006) Some researchers prefer to perform the examination of
provide a starting point for identifying terms that may be documents as they proceedfor example, conduct the
useful to qualify the search and increase its precision. search of a database and then evaluate the results before
The methodology that provides usable information for proceeding to the next searchrather than batching all of
research synthesis may be described by terms such as: the this activity until after all of the searches have been
experiment, field experiment, or control group. A risk of conducted. This ongoing examination enables the re-
methodology limits is that the term specified may not searcher to continuously validate the quality of the search
have been used by the author or subject indexer, resulting strategy itself. Identification of false negatives might sug-
in a missed citation. gest a need to modify the strategy.
One must always be concerned with the trade-off be- 5.3.2.6 Final Searches The sixth step, following ex-
tween false positives (poor accuracy resulting in unneces- amination of documents and possibly the analysis of data,
sary work to examine and eliminate irrelevant research is to fill in the gaps with supplemental focused searches.
reports) and false negatives (the attempt to get better pre- If the process of examining studies and conducting analy-
cision yielding more relevant citations). At present, given ses has taken a significant amount of time, it may be im-
the lack of methodological clarity in many reference da- portant to revisit searches previously conducted to iden-
tabases, gaining increased precision through the use of tify and recover recent documents. Here one can probably
methodological limiters within the search is a challenge. limit the search to the period of time since the search was
5.3.2.5 Document Evaluation The fifth step is to last conducted, for example, the most recent year. This
evaluate the various documents identified in the reference update is an important step, given that one probably does
database searches. This will require time and effort. It not want to publish a research synthesis and find that an
may require examining many false positives if the search important new study, or one with divergent results, has
is overly broad. When evaluating search results, it is ad- been overlooked.
vised to begin by examining the citation itself (title and
abstract) to determine if the cited document has some rel- 5.3.3 Author Search
evance. For example, does it meet the criteria for the syn-
thesis? Is the methodology appropriate? Is the documents A useful complementary strategy is to conduct author
subject within the defined scope? If the study appears rel- searches in the selected reference databases. Identify sev-
evant, one needs to next examine the document itself. eral key authors who have published extensively on the
This may require an interlibrary loan if the local library topic. In the eyewitness testimony example, several au-
does not have a document and it is not available elec- thors are influential and prolific within the psychological
tronically. In examining documents, watch for additional literatureto illustrate, Elizabeth Loftus, Steven Penrod,
studies that were not identified in the reference database and Gary Wells. Searching the author field of the refer-
searches. Does the author cite an unpublished conference ence database located additional citations. The drawback
paper, an unpublished technical report, or the report of of this approach was the retrieval of many false positives.
research conducted under a grant? Is there information Each of these researchers has published in a number of
about a body of work being done at a research institute in areas, yielding many citations retrieved that were irrele-
another part of the world? In both of these situations, one vant to the topic at hand.
is advised to contact the author of the study. At this point,
you may need to conduct hand searches (see chapters 4 5.3.4 Citation Searching
and 6, this volume). When examining a chapter in a book,
check to see if there is another chapter in the book that Citation searching is an important complement to search-
is relevant. We found additional research reports on eye- ing the literature by subject. It allows the researcher to
witness testimony in this way. The presence of any of avoid reliance on the subjectivity of indexers and avoids
these hints provides an alert that additional searches the inherent currency lag and biases of controlled vocabu-
may be required of additional databases that have not yet lary. Citation searching relies on three important assump-
been tapped. Participation in Cochrane or Campbell may tions: one or several key chronologically early sources
provide access to an international network of scholars exist on a topic, these key sources are well known within
90 SEARCHING THE LITERATURE

the field, and these key sources are cited by others con- measured trade-offs if using both databases proves
ducting and publishing subsequent research on the topic. impractical (2003). Failure to do so may result in an in-
A citation index can identify all articles, reports, or complete search and missed sources (numerous false
other materials in the body of literature it covers that have negatives). Most databases provide a list of journals and
cited the key reference. To conduct a citation search, sometimes serials indexed. Consulting such lists allows
begin by identifying several important, seminal sources you to determine the overlap or independence of the in-
(articles or books) on the research topic (see, for example, dexing sources.
Loftus 1975; Wells 1978). These works are highly likely
to be cited by researchers working on problems related to 5.4.2 Need for Multiple Searches
eyewitness accuracy, and as such are good candidates for
a citation search. In conducting an exhaustive search on a topic, multiple
Three indexes of interest to social and behavioral sci- searches are generally needed. Searches also occur in
entists allow access to research literature by their cited phases.
references: Social Sciences Citation Index (SSCI), Sci- As noted, searches in the first phase will be prelimi-
ence Citation Index (SCI), and PsycINFO. In the case of nary, exploratory searches to understand the topic, iden-
PsycINFO, inclusion of cited references in records and tify search terms, and confirm which reference databases
the ability to conduct searches on them is a reasonably are relevant for fuller searches.
recent development. Because cited references from source The second phase will be much more extensive and
articles began only in 2001, the remaining two indexes, thorough, and will require multiple searches of each rel-
both published by ISI, will provide more extensive cover- evant reference database. It will also require searching
age. To conduct a citation search, one would specify an grey literature. Each search on the sample topic returned
original document, for example, the Loftus article, within different results. For example, following examination of
the search engine. The system would search its database PsycINFO (described in table 5.5), we turned to the
and produce a list of other citations contained within its Criminal Justice Periodical Index provided by ProQuest.
database that have made reference to your initial citation. Several different searches were conducted using varying
search configurations. As in PsycINFO, results were dif-
5.4 ISSUES AND PROBLEMS ferent for each search. Though there was some overlap
between these searches, new searches yielded new cita-
5.4.1 Need for Searching Multiple Databases or tions that appeared potentially relevant. Results of six
Indexes searches of Criminal Justice Periodical Index are noted in
table 5.7. Because many journals indexed in this source
Each database or index is designed for a particular audi- are not included in PsycINFO, this set of searches yielded
ence and has a particular mission, as noted in the section additional documents that should be considered for inclu-
on databases. Sources that overlap in topic, publication sion in our synthesis.
formats covered, or source title coverage often have at Subsequent searches of ProQuest Dissertations &
least as many differences as they do similarities. On top- Theses, Lexis-Nexis Academic Universe, and MEDLINE
ics that are interdisciplinary (such as, stress, communica- produced similar results. In consulting each, we con-
tion, conflict, leadership), it is essential to consult multi- ducted several searches, varying the terms included and
ple databases. This is equally important when using the logic structure. The result is that in gradually ex-
databases that appear so comprehensive as to be duplica- panded the fishing net, we gradually expanded the num-
tive. One may elect to consult both ERIC and the Wilson ber of citations and documents of potential relevance to
Education Index to ensure comprehensive coverage in an our research synthesis. Had we not done so, we would
education project. have failed to uncover relevant documents. It was inter-
MEDLINE and EMBASE include extensive, interna- esting that a number of dissertations and theses have been
tional coverage of the life sciences and biomedical litera- focused on various aspects of eyewitness testimony. The
ture, indexing thousands of publications (Weill 2008). first, a fairly broad search, identified 149 dissertations or
Although a comprehensive research synthesis demands theses that address some aspect of eyewitness testimony.
exploiting the contents of both, Margaret Sampson and These must be examined to ascertain which are relevant
her colleagues suggested that researchers might make to the project.
USING REFERENCE DATABASES 91

Table 5.7 Eyewitness Testimony Search Results in Criminal Justice Periodical Index Pro Quest.

# Search Terms Field Searched Dates Results

1 Eyewitness AND Testimony Citation Abstract All 105 citations


2 Eyewitness AND Testimony AND Memory Citation Abstract All 13 citations
3 Witness AND Psychology Citation Abstract All 37 citations
4 Eyewitness OR Witness AND Testimony OR Citation Abstract All 6 citations
Report AND Memory AND Accuracy
5 Eyewitness AND Reliability AND Memory Citation Abstract All 4 citations
6 Victim AND Testimony Citation Abstract All 205 citations

source: Authors compilation.

Table 5.8 Comparable Eyewitness Testimony Searchers from Three Databases


Search Parameters Results Uniqueness

PsycINFO
(Eyewitness OR Witness OR Observer OR Victim) AND 165 citations Duplicated in Criminal Justice Periodical
(Testimony OR Report OR Evidence) AND (Memory OR Index 2 Duplicated in Dissertations &
Recall OR Recollection OR False Memory OR Theses 13 Unduplicated 150 (91%)
False Recall) AND (Reliability OR Accuracy OR
Distortion) AND Age Group: Adult (18 years &
older) AND Population: Human
Criminal Justice Periodicals
(Eyewitness OR Witness OR Observer OR Victim) AND 196 citations Duplicated in PsycINFO 2 Duplicated
(Testimony OR Report OR Evidence) AND (Memory OR in Dissertations & Theses 0
Recall OR Recollection OR False Memory OR False Unduplicated 194 (99%)
Recall OR Reliability OR Accuracy OR Distortion)
ProQuest Dissertations & Theses
(Eyewitness OR Witness OR Observer OR Victim) AND 110 citations Duplicated in PsycINFO 13 Duplicated
(Testimony OR Report OR Evidence) AND (Memory OR in Criminal Justice Periodical Index 0
Recall OR Recollection OR False Memory OR False Unduplicated 97 (88%)
Recall) AND (Reliability OR Accuracy OR Distortion)

source: Authors compilation.

How much overlap is there across reference databases? completeness, it is essential that the research synthesist
The answer depends on the topic and field. Searches were investigating eyewitness testimony consult at least these
conducted of three databases (PsycINFO, Criminal Jus- three databases. Then recall that these three databases do
tice Periodical Index, and ProQuest Dissertations & The- not include conference papers, technical reports, and other
ses) using the sample topic as articulated in figure 5.1. formats, and thus additional searching is appropriate, if
Each search was structured as close to the topic definition not critical.
as possible, with slight variations resulting from the
search configuration allowed by the database. Each data- 5.4.3 Manual Searching Versus Electronic
base contained over one hundred citations to potentially Databases
relevant studies. The vast majority of studies (more than
88 percent) found in one reference database source were We are blessed with powerful search engines and exten-
not found in the other two shown in table 5.8. Thus, for sive databases. However, these sources are relatively new.
92 SEARCHING THE LITERATURE

Some sources, such as PsycINFO, have added older cita- On the other hand, there are limits to free text search-
tions retroactively, however, not all have done so. As a ing, especially if one does not consult a thesaurus. The
result, you may need to conduct manual searches of the most significant problems are the occurrence of false
subject and author indexes reference databases that are negatives (failure to find relevant sources that exist) be-
only available in hardcopy form. cause important subject descriptors were not included in
For example, if your topic predates the initiation of the the search structure and identification of false positives.
electronic ERIC reference database in 1966, you may The bottom line is that it makes sense to include both
consult paper indexes such as Education Index to identify thesaurus subject terms, as well as relevant nonthesaurus
citations before 1966. Wilson Education Index began as terms in your search. This makes it even more critical that
an electronic index in 1983, and added abstracts in 1994, your early work and planning be thorough and grounded
but its hardcopy predecessor was launched in 1929. Wil- in reading and preliminary searching forays into the data-
son has provided a new service, Education Index Retro- bases.
spective, that covers indexing of educational literature
from 1929 to 1983. In this case, one would supplement an 5.4.5 Underrepresented Research Literature
ERIC search with a search of the Wilson products. Formats

5.4.4 Limits of Vocabulary Terms As we discussed earlier in the context of database selec-
tion, some research publication formats are not as well
Before beginning a search, it is a good idea to check the represented in reference databases as others. Locating
controlled vocabulary of a database by referring to the information about studies not published in the journal
published thesaurus for that database, if one is available. literature and in non-English languagesas well as ob-
As noted earlier, the thesaurus will identify those terms taining such documentscan be challenging. It bears
used by indexers (in PsycINFO, they are called subjects) repeating that, after constructing or beginning a literature
to describe the subject content of a source when entering search strategy, locating such material takes persistence
citations into the database. In most databases, the record and a search strategy that casts a broad net for relevant
for a citation typically contains a listing of the subjects literature.
used to index that source, in addition to an abstract and 5.4.5.1 Reports and Papers Some conferences pub-
other information. In many databases (such as Psy- lish proceedings that may be covered by a relevant index-
cINFO), the subject terms are hyperlinked, enabling you ing service (database). Some researchers may take the
to do a secondary search to find articles indexed using initiative to submit a conference paper to a service such
those subjects. The thesaurus can be helpful in identify- as ERIC if their research is within the domain of a par-
ing terms related to your topic that you had not thought of ticular index. However, information on much of this ma-
using. terial remains inaccessible as fugitive or grey literature in
There are limits to use of controlled vocabulary, how- a discipline (Denda 2002).
ever. In an emerging subfield, it may take several years 5.4.5.2 Books and Book Chapters Information on
for particular terms to emerge as recognized descriptors book-length studies is not difficult to find, largely be-
of the subject content, and to find their way into the con- cause of extensive, shared library cataloging databases
trolled vocabulary of the thesaurus. For example, Psy- used for many years by libraries. Primary among these
chological Abstracts (and PsycINFO) did not include the is OCLC WorldCat, which contains records for books,
terms intelligence or intelligence quotient until 1967, manuscripts, and nonprint media held by thousands of
and still does not include the term IQ in its controlled libraries. More recently, it has incorporated information
vocabulary. Because indexes rarely do retrospective re- about Internet sites, datasets, media in emerging formats,
visions to subject terms in materials already indexed, and so forth.
you must use a variety of terms to describe the subject Information on the individual chapters within books
content of a new field. Human error also limits use of remains challenging, however, especially regarding ed-
subject terms. Some indexers, for example, may not be ited volumes in which chapters are written by many in-
familiar with an emerging field, or may have biases dividuals. PsycINFO is unique in that it includes cita-
against certain terms, limiting reliance on the controlled tions to English-language books and chapters back to
vocabulary. 1987, with selected earlier coverage. Once identified, most
USING REFERENCE DATABASES 93

books published by either commercial or noncommercial languages other than English, and studies published in
publishers can be acquired more easily than unpublished English appear to be associated with larger effects (see
material. chapter 6). In many disciplines, the researcher must find
5.4.5.3 Conference Papers and Proceedings In ad- ways to identify documents outside of the limited scope of
dition to the format-specific databases, you may need to their language and nationality. Once a document has been
consult conference programs in a relevant field to identify identified that is written in German, or Italian, or Russian,
important unpublished documents. Acquiring conference or Chinese, you will have to examine it to determine
papers often requires time and persistence, especially whether it is relevant to your subject, and then extract the
when it comes to the complete text of a conference pre- important details on methodology, results, and conclu-
sentation (plus any presentation media) not subsequently sions. This may require translation. At times, an available
published in a journal article, book chapter, or as part of English language abstract may provide an initial clue
conference proceedings book produced by a commercial whether the document is relevant. If a document abstract is
publisher. not available in a language you read, you will need to find
Although these materials may provide valuable infor- a way to ascertain whether the contents might be relevant.
mation, as they are not published in the typical sense of Participation in a collaborative network such as the Co-
the word, and may not have been through a review pro- chrane or the Campbell may provide access to peers with
cess, the results may be preliminary and quality of the related interests who may be willing to collaborate and
research may be unknown. Thus, such research must be assist in your task. They may be willing to assist in the
evaluated and used accordingly. search of non-U.S. databases. They may be willing to as-
sist with translations of relevant non-English documents.
5.4.6 Inconsistent Coverage Over Time Consulting non-English or non-American documents
is important in many disciplines and becomes especially
From their inception in the 1960s, bibliographic databases important in exploring broad topics. For example, non-
suffered from uneven coverage and, to some extent, this Western and non-U.S. literature becomes important in
continues as publication formats have evolved and re- study of the effectiveness of health-care approaches in
search has become more interdisciplinary. For example, treatment of certain diseases. If studying a health care
the PsycINFO bibliographic database provides research- problem that may be treated with acupuncture, failure to
ers with retrospective coverage of the disciplines research consult sources that provide access to publications such
literature, dates of coverage that span many decades, and as Clinical Acupuncture & Oriental Medicine, or Journal
a highly structured record format with a supporting con- of Traditional Chinese Medicine (Beijing) may be a major
trolled subject vocabulary. Given this, it does present error.
many idiosyncrasies in coverage that bear consideration. Non-U.S. literature is relevant in understanding many
The controlled vocabulary was uniformly applied only to aspects of behavior in organizationsfor instance, deci-
citations indexed after 1967. Coverage of books and book sion making, motivation, and negotiation. The field of
chapters have been consistently added since 1987 but only organizational behavior was dominated for many years
those published in English. The ability to search using by researchers proposing theories and testing them in the
cited references enhances the usefulness of the database, United States. Geert Hofstedes studies of culture demon-
but addition of this feature was relatively recent. Its cover- strated that differences in national culture directly influ-
age of open-source e-journals is inconsistent. The case of ence organizational culture (1987, 1993). When some
PsycINFO illustrates a basic caveat to keep in mind when theories developed in the United States have been tested
using any bibliographic database: the researcher should in other countries and other cultures, conflicting data
examine coverage and usage notes when using a file. sometimes emerged calling into question the cross-cultural
applicability of the theoretical propositions. Failure to
5.4.7 Non-U.S. Publication include non-English and non-Western sources may lead
to bias and questions of generalizability of the research
Many scholars rely heavily on sources published in the synthesis. To explore the non-U.S. literature, you may
United States and rely on reference databases published need to consult a colleague in another country who has
in the United States. However, there is much research access to reference databases covering literature not lo-
conducted and published outside the United States and in cally available to you.
94 SEARCHING THE LITERATURE

There may be limits, however. How much searching


must one do? How much time must be spent to identify 5.4.9 World Wide Web and Search Engines
that last missing relevant document is a judgment call. Many people jump quickly to the information superhigh-
You must do enough searching to make certain that you way to find answers to questions. Some sources are avail-
have found all of the important relevant sources. On the able online. Most reference databases, however, are not
other hand, the task without end will never be completed available directly online. There is typically a subscription
and published. How many hours of additional searching fee, paid by your local institution, for access to core refer-
are justified to find one more citation? ence databases. Conducting your search using an Internet
search engine such as AltaVista, Google, Lycos, MSN, or
Yahoo will not provide access to the contents of reference
5.4.8 Adequate Search Documentation
databases such as PsycINFO or Criminal Justice Periodi-
In documenting your search, you must disclose enough cal Index. Your search may provide other information ad-
information to enable the reader to replicate your work. dressed in other chapters.
The reader must also be able to identify the limits of your A new service has appeared only recently, Google
search, and ways in which she or he might build on your Scholar. Although it shows promise, its scope is limited to
work. A researcher should make this information trans- materials accessible on the Web at this writing; contents
parent to subsequent users of the research synthesis. of proprietary databases such as PsycINFO are generally
Unfortunately, too often researchers do not disclose not accessed. It thus provides access to books that can be
enough information in publications to enable the reader to identified in publicly available library catalogs, journals
do either of these. As readers, we often do not know what that are published online (for example, publications of
sources (databases) were consulted or the search strategy Springer), journal articles that are accessed through net
used. Of concern is the possibility that a researcher does available research databases such as ERIC and PubMED.
not keep records on what reference databases were Only time will tell whether additional content is added.
searched on what dates with what search strategies. Ad-
ditionally, poor record keeping of searches conducted 5.4.10 Deciding When to Stop Searching
may lead to repetition of searches, or holes in the search.
In section of 5.2.2 of its Cochrane Handbook for Knowing when your task is complete is as difficult as
Systematic Reviews of Interventions, the Cochrane Col- defining the topic, structuring the search, and deciding
laboration provides guidance on record keeping and dis- what is truly relevant. We are unaware of any empirical
closure (Higgins and Green 2006). They suggest that for guidelines for completeness in the social sciences. Mike
each search conducted, the researcher should record Clarke, director of the UK Cochrane Center, describes
this as one of the difficult balances that has to be achieved
title of database searched, in doing a synthesis (Personal communication, Novem-
name of the host or vendor, ber 26, 2006). It involves the trade-off between finishing
the project in a timely way and being fully comprehen-
date of the search, sive and complete. One must be pragmatic. This is a judg-
years covered by the search, ment task the research synthesist must embrace. We as-
sume that you have followed the procedures suggested in
complete search strategy with all search terms used,
this chapter. We also assume that along the way you have
descriptive summary of the search strategy, and used other methods such as hand searching and contact-
language restrictions. ing key sources for grey literature. You might want to
consider the following questions:
In its handbook, Cochrane even provides examples of
search strategies for electronic databases (Higgins and Have you identified and searched all of the reference
Green 2006). When arriving at the level of the individual databases that are likely to contain significant num-
citation and document, we find that it is sometimes help- bers of relevant citations?
ful to record the unique reference database accession Have you verified that there are no systematic biases
number of each citation identified, in addition to biblio- of omissionfor example, you have not excluded
graphic information (author, title, date, and so on). non-English language sources?
USING REFERENCE DATABASES 95

Have you modified your search strategy by adjusting Producer: American Association of Retired Persons
search terms as you identified and examined cita- (AARP), Policy and Research, Washington, D.C.
tions highly relevant to your topic? (http://www.aarp.org)
As you varied the parameters in searching any single Updates: Bimonthly
reference database, how many new, unique, relevant
Publication formats included: journal articles, books
citations did you identify in the last two searches?
and book chapters, government reports, disserta-
As you compare the results from searching comple- tions, policy and research reports, conference pro-
mentary reference databases, how many new, unique, ceedings
relevant citations were identified in the last two
Controlled vocabulary: Thesaurus of Aging Terminol-
searches?
ogy, 8th ed. (2005) Available for download.
If you conducted author searches on the three or four
of the most prolific authors on your topic, how many This file focuses broadly on the area of social gerontol-
new, unique, relevant citations were retrieved on the ogy; specifically the social, behavioral, and economic
last search or two? experiences of those over age fifty. It cites a limited
amount of consumer-oriented material, although empha-
If your answers to these questions is that the rate of sis is research publications. Freely available for searching
identification of new citations in your three most recent from the AARP website (http://www.aarp.org/research/
searches is less than 1 percent, you are probably finished ageline/ index.html).
searching, or very close. If the last ten hours of intense
database searching have produced no new relevant cita-
tions you are probably finished. If, on the other hand, you Campbell Library
are still finding significant numbers of new relevant cita- The goal of the Campbell Collaboration is to enable peo-
tions with each new search, then the task is not complete. ple to make well-informed decisions about the effects of
interventions in education, criminal justice, and social
5.5 APPENDIX: SELECTED DATABASES IN welfare. The Campbell Collaboration (C2), established in
THE BEHAVIORAL AND SOCIAL SCIENCES 2000, is an international nonprofit organization, modeled
after the Cochrane Collaboration. C2 prepares, maintains,
The databases listed represent a selection of biblio- and disseminates systematic reviews. The Campbell Col-
graphic files indexing empirical research studies. Partic- laboration Library consists of several databases including
ulars such as years of coverage, update frequency, and C2 Reviews of Interventions, and Policy Evaluations (C2-
specific features are subject to frequent change, so much RIPE), and C2 Social, Psychological, Education, and
of this information should be verified with the database Criminological Trials Registry (C2-SPECTR).
producer or vendor or by consulting members of a li-
brary reference department. Because database vendors
are subject to corporate mergers and producers complete C2 Reviews of Interventions, and Policy
or terminate contracts with vendors frequently, specific Evaluations (C2-RIPE)
database vendors are not listed. To ensure accurate and
up-to-date status of availability and specific features Years of coverage: Reviews since 2000
available, we recommend consulting the producer web Producer: Campbell Collaboration
sites provided below or library reference staff for this
information. Gale Directory of Databases is a valuable Updates: Ongoing
tool for verifying changes in database scope, coverage, Publication formats included: Research syntheses
and availability.
Availability: Online (http://www.campbellcollaboration
.org/index.asp).
Ageline
The C2-RIPE database contains approved Titles, Pro-
Years of coverage: 1978, with selective earlier cover- tocols, Reviews, and Abstracts for systematic reviews in
age education, crime and justice, and social welfare.
96 SEARCHING THE LITERATURE

C2 Social, Psychological, Education, and As of February 2007, the Cochrane Database of System-
Criminological Trials Registry (C2-SPECTR) atic Reviews included 4655 reviews. Systematic reviews
adhere to strict design methodology. They are comprehen-
Years of coverage: Most research reports are post 1950, sive (including all known instances of trials on an interven-
with the oldest published in 1927 (as of February tion) to minimize the risk of bias in results. They assesses
2007) whether a particular health-care intervention works.
Producer: Campbell Collaboration
Updates: Ongoing Cochrane Central Register of Controlled Trials
(CENTRAL)
Publication formats included: Citation, availability in-
formation, and usually an abstract describing ran- Years of coverage: Earliest identified study was in
domized or possibly randomized trials in education, 1898 (as of February 2007)
social welfare and criminal justice
Producer: Cochrane Collaboration
Availability: Online (http://www.campbellcollaboration
Updates: Ongoing
.org/index.asp)
Publication formats included: Information on con-
As of February 2007, the C2-SPECTR database con- trolled trials of health-care interventions.
tained nearly 12,000 entries on randomized and possibly
randomized trials. C2-SPECTR was modeled after the Availability: Online (http://www.cochrane.org; http://
Cochrane Central Register of Controlled Trials. www3.interscience.wiley.com/cgi-bin/mrwhome/
106568753/ HOME)
Description: As of February 2007, the CENTRAL da-
Cochrane Library
tabase included information on 489,167 controlled trials.
The goal of the library is to aid health-care decision mak- CENTRAL includes the details of published articles (not
ing by providing high-quality, independent evidence of the articles themselves) taken from health-care biblio-
the effects of health-care treatments. It is supported by graphic databases (notably MEDLINE and EMBASE),
the Cochrane Collaboration, an international, nonprofit, and other published and unpublished sources. It includes
independent organization established in 1993 that pro- the results of hand searches of health care literature, and
duces and disseminates systematic reviews of healthcare reports not contained in published indexes. It includes
interventions. It also promotes the search for evidence in non-English language documents and reports produced
the form of clinical trials and other studies of interven- outside the United States.
tions. The library consists of several databases including
the Cochrane Database of Systematic Reviews, the Co- Cochrane Database of Abstracts of Reviews of
chrane Central Register of Controlled Trials (CENTRAL), Effects (DARE)
and the Cochrane Database of Abstracts of Reviews of
Effects (DARE). Years of coverage: Reviews since 1993
Producer: Cochrane Collaboration
Cochrane Database of Systematic Reviews Updates: Ongoing

Years of coverage: Systematic research reviews since Publication formats included: Reviews of research
1993 syntheses

Producer: Cochrane Collaboration Availability: Online (http://www.cochrane.org; http://


www3.interscience.wiley.com/cgi-bin/mrwhome/
Updates: Ongoing 106568753/HOME)
Publication formats included: Research syntheses DARE contains abstracts of systematic reviews in the
Availability: Online (http://www.cochrane.org/; http:// field of health care that have been assessed for quality.
www3.interscience.wiley.com/cgi-bin/mrwhome/ Each abstract includes a summary of the review together
106568753/HOME) with a critical commentary about the overall quality. As
USING REFERENCE DATABASES 97

of February 2007, DARE included 3,000 abstracts of sys- Journal of Economic Literature. The subject descrip-
tematic reviews. tor and corresponding code list are linked from the
EconLit site (http://www.econlit.org).
CORK: A Database of Substance Abuse International coverage of literature of theoretical and
Information applied topics in economics and related areas in mathe-
Years of coverage: 1978, with selected earlier cov- matics and statistics, economic behavior, public policy
erage and political science, geography and area studies.

Producer: Project CORK, Dartmouth University Medi-


cal School (http://www.projectcork.org) EMBASE
Updates: Quarterly Years of coverage: 1974
Publication formats: Journal articles, books and book Producer: Elsevier Science (http://www.elsevier.com)
chapters, conference papers, reports generated from
federal and state agencies as well as foundations and Updates: Daily
professional organizations Publication formats included: Journal articles
Controlled vocabulary: Thesaurus of subject terms is ac- Controlled vocabulary: EMBASE thesaurus called EM-
cessible as part of the database. Interdisciplinary TREE annual
coverage of addiction research in clinical medicine,
law, social work, and the broader social sciences lit- International coverage of the biomedical literature,
erature. Includes addictive behaviors and co-depen- with recognized strength in drug-related information and
dency research. strong coverage of the international journal literature.

Criminal Justice Periodical Index


ERIC (Educational Resources
Years of coverage: 1974 Information Center)
Producer: ProQuest, Ann Arbor, Mich. (http://www. Years of coverage: 1966
proquest.com)
Producer: Computer Sciences Corporation, under con-
Updates: Weekly tract from the U.S. Department of Education, Insti-
Publication formats: Journal articles tute of Education Sciences, Washington, DC (http://
eric.ed.gov)
Supplies citations and abstracts for more than 200 core
titles in law enforcement and corrections practice, crimi- Updates: Weekly
nal justice, and forensics. Some indexed material is avail- Publication formats included: Journal articles, unpub-
able in fulltext. lished research reports, conference papers, curricu-
lum and lesson plans, and books. Many unpublished
EconLit reports may be downloaded from the web site at no
cost.
Years of coverage: 1969
Controlled vocabulary: ERIC Thesaurus, available on-
Producer: American Economic Association (http://www.
line from the ERIC website
aeaweb.org)
Updates: Monthly The primary reference database on all aspects of the
educational process, from administration to classroom
Publication formats included: Journal articles, books, curriculum, preschool to adult and postdoctoral levels.
book and book reviews, chapters in edited volumes, Because its scope encompasses almost every aspect of
working papers, dissertations the human learning process. The contents of this file are
Controlled vocabulary: American Economic Associa- of interest to a broad range of social scientists. Freely
tion Classification System, previously used in the available for searching from the ERIC site.
98 SEARCHING THE LITERATURE

International Bibliography of the Social Sciences Controlled vocabulary: NCJRS Thesaurus, searchable
online as part of the database
Years of coverage: 1951
Focus is on criminal justice and law enforcement lit-
Producer: London School of Economics and Political erature, although there is additional coverage in the areas
Science (http://www.lse.ac.uk) of forensics, problems related to criminal behavior; for
Updates: Weekly example, drug addiction, social deviancy, juvenile and
Formats: Journal articles, books, book chapters family behavior delinquency, victim and survivor re-
sponses to crime). Freely available for use from the pro-
Emphasis on the fields of anthropology, economics, ducers website.
politics, and sociology, with selected interdisciplinary
coverage of related areas such as gender studies, human
development, and communication studies. Noteworthy National Technical Information Service
for coverage of journal literature published outside the Years of coverage: 1969, with selected earlier cov-
English-speaking world, including abstracts in multiple erage
languages as those abstracts are available.
Producer: National Technical Information Service, U.S.
MEDLINE Department of Commerce (http://www.ntis.gov)
Updates: Monthly
Years of coverage: 1966 A version available from
some vendors includes a retrospective component Format: Unpublished research reports, datasets
with citations back to 1950.
An important source of information on reports resulting
Producer: National Library of Medicine, Bethesda, Md. from government-sponsored research. Although emphasis
Updates: Daily is on engineering, agriculture, energy, and the biological
sciences, reports also cover personnel and management,
Formats: Journal articles. An increasing number of
public policy, and social attitudes and behavior. The ver-
cited articles can be downloaded.
sion searchable from the NTIS web site covers 1990 to
Controlled vocabulary: Medical Subject Headings, re- the present and many recent technical reports can be
vised annually and searchable online downloaded as fulltext documents.
Exhaustive in its coverage of biomedicine and related
fields for example, life sciences, clinical practice, nurs- Papers First and Proceedings First
ing. MEDLINE indexes approximately 4,800 journals
worldwide. It also includes coverage of health care is- Years of coverage: 1993
sues and policy, clinical psychology and psychiatry, and Producer: OCLC Online Computer Library Center
interdisciplinary areas such as gerontology. An increas- (http://www.oclc.org)
ing number of items indexed are available online as full-
text documents. Freely available for searching from the Updates: Semimonthly and semiweekly, respectively
NLMs PubMed service (http://pubmed.gov). Publication formats: Papers presented at professional
conferences and symposia
National Criminal Justice Reference Service
These are complementary databases, information for
Years of coverage: 1970 both derived from the extensive conference proceedings
Producer: Office of Justice Programs, U.S. Department holdings of the British Library Document Supply Centre.
of Justice (http://www.ncjrs.gov/library.html) Papers First contains bibliographic citations to presented
papers or their abstracts, whereas Proceedings First in-
Updates: Quarterly dexes a compilation of papers and includes a list of those
Formats: Journal articles and newsletters, government presented at a given conference. The Directory of Pub-
reports (federal, state, local), statistical compendia, lished Proceedings database produced by InterDok
books, published and unpublished research reports (http://www.interdok.com), is not as extensive.
USING REFERENCE DATABASES 99

ProQuest Dissertations and Theses Database APA launched a series of complementary databases in
2004. The purpose of PsycEXTRA is to index grey litera-
Years of coverage: 1861 ture as represented in conference proceedings; reports
Producer: ProQuest, Ann Arbor, MI (http://www produced by federal and state governments, nonprofit and
.proquest.com) research organizations, and professional associations;
nonprint media; popular and trade publications. Psyc-
Updates: Continuous BOOKS provides fulltext access to books published by
Publication formats: Doctoral dissertations, masters APA as well as selected out-of-print books and classic
theses texts in psychology originally published in the late-nine-
teenth and early twentieth centuries.
PDTD is a multidisciplinary index of dissertations pro-
duced by from more than 1,000 American institutions and
a growing number of international universities. As such, Science Citation Index
it is a unique access point to an important body of origi- Years of coverage : 1900
nal research literature. The very earliest dissertations
Producer : Thomson Corporation, Stamford, Conn.
contain only basic bibliographic descriptions. Lengthy
(http://www.isiknowledge.com).
descriptive abstracts of dissertations are available from
1980 forward, with shorter abstracts of masters theses Updates: Weekly
beginning in 1988. This product has been enhanced by Formats: Journal articles
addition of page previews of indexed documents, gen-
erally about twenty five pages in length. If available, the Social Sciences Citation Index
complete text of dissertations can be downloaded in PDF
format. Otherwise, copies can be ordered in microformat Years of Coverage: 1956
or hardcopy from the database distributor. Producer: Thomson Reuters, Philadelphia. (http://www.
isiknowledge.com)
PsycINFO Updates: Weekly
Formats: Journal articles
Years of coverage: pre-1900
Producer: American Psychological Association, Wash- Each of these products indexes the contents of thou-
ington DC (http://www.apa.org) sands of journal publications and incorporates English-
language abstracts as they appeared in the journals since
Updates: Weekly the early 1990s. Although its possible to search by key-
Formats: Journal articles, books, book chapters, dis- words in titles and abstracts for indexed articles, the
sertations, research reports strength of these long-standing citation indexes remains
the ability to search by cited reference; that is, using key
Controlled vocabulary: Thesaurus of Psychological references to discover other, more recent publications.
Index Terms, 11th edition (2007) Both SCI and SSCI are accessible via the Thomson/ISI
Indexes more than 2,000 journal titles in both English Web of Science. Scopus, a competing product produced
and foreign languages, as well as selected English- by Elsevier (http://www.scopus.com ), began production
language books and other formats. Coverage of publica- in 2006 and promises expanded coverage of grey litera-
tion formats has changed dramatically since the database ture formats and increasing indexing of journal titles and
began in 1967, as have search features; for example, cited trade publications.
reference searching began in 2002 but is not accessible in
all years of coverage. Coverage of the discipline of psy- Social Work Abstracts
chology is very broad and highly interdisciplinary include
publications in biology and medicine, sociology and so- Years of coverage: 1977
cial work, and applied areas such as human-computer Producer: National Association of Social Workers, Sil-
interaction and workplace behavior. ver Spring, Md. (http://www.socialworkers.org)
100 SEARCHING THE LITERATURE

Updates: Quarterly books, report series, and non-print media, published arti-
Formats: Journal articles and dissertations cles and book chapters are seldom included.

Despite significant overlap in subject scope with other


5.6 REFERENCES
files, Social Work Abstracts supplements other indexes
with its coverage of substance abuse, gerontological, APA Online. 2006. PsycINFO Database Information. http://
family and mental health literatures. www.apa.org/psycinfo/products/psycinfo.html.
APA Online. 2007. PsycINFO Journal Coverage List. http://
Sociological Abstracts www.apa.org/psycinfo/about/covlist.html.
Aby, Stephen H., James Nalen, and Lori Fielding. 2005.
Years of coverage: 1952 Sociology: A Guide to Reference and Information Resour-
Producer: Cambridge Scientific Abstracts, Bethesda, ces, 3rd ed. Westport, Conn.: Libraries Unlimited.
Md. (http://www.csa.com) Balay, Robert, ed. 1996. Guide to Reference Books. Chicago:
American Library Association.
Updates: Monthly Benjamin, Ludy T., and Gary R. VandenBos. 2006. The
Formats: Journal articles, books, conference publica- Window on Psychologys Literature: A History of Psycho-
tions, dissertations logical Abstracts. American Psychologist 61(9): 94154.
Biondi-Zoccai, Giuseppe GL, Pierfrancesco Agostoni, Anto-
The primary indexing tool for journal literature (almost
nio Abbate, Luca Testa, and Francesco Burzotta. 2005. A
2,000) titles in sociology as well as interdisciplinary top-
Simple Hint to Improve Robinson & Dickersins Highly
ics such as criminal justice, social work, gender studies,
Sensitive PubMed Search Strategy for Controlled Clinical
population and gender studies.
Trials. International Journal of Epidemiology 34(1): 224.
Boole, George. 1848. The Calculus of Logic. Cambridge
Wilson Education Index and Dublin Mathematical Journal, III: 18398.
Cohen, Claudia E., and Ebbe B. Ebbesen. 1979. Observatio-
Years of coverage: 1929
nal Goals and Schema Activation: A Theoretical Framework
Producer: H. W. Wilson Company (http://www.hwwilson. for Behavior Perception. Journal of Experimental Social
com) Psychology 15(4): 30529.
Updates: Daily Cutler, Brian L., Garrett L. Berman, Steven Penrod, and Ro-
nald P. Fisher. 1994. Conceptual, Practical, and Empirical
Formats: Journals, yearbooks, and monographic series
Issues Associated with Eyewitness Identification Test
Wilson Education Index includes coverage of 475 core Media. In Adult Eyewitness Testimony: Current Trends
international English-language periodicals and other for- and Developments, edited by David Frank Ross, Don J.
mats relevant to the field of education from 1983 to the Read, and Michael P. Toglia. New York: Cambridge Uni-
present. It is supplemented by Education Index Retro- versity Press.
spective covering the period of 1929 to 1983. Dale, Philip S., Elizabeth F. Loftus, and Linda Rathbun.
1978. The Influence of the Form of the Question on the
WorldCat Eyewitness Testimony of Preschool Children. Journal of
Psycholinguistic Research 7(4): 26977.
Years of coverage: comprehensive Deffenbacher, Kenneth A., Brian H. Bornstein, Steven D.
Producer: Online College Library Center, Dublin, Oh. Penrod, and E. Kiernan McGorty 2004. A Meta-Analytic
(http://www.oclc.org) Review of the Effects of High Stress on Eyewitness Me-
Updates: Daily mory. Law and Human Behavior 28(6): 687706.
Formats: Books, manuscripts and special collections, Denda, Kayo. 2002. Fugitive Literature in the Cross Hairs:
dissertations, media An Examination of Bibliographic Control and Access.
Consisting of records of holdings contributed by over Collection Management 27(2): 7586.
9,000 member libraries in the United States and abroad, Fisher, David, Sandra Price, and Terry Hanstock, eds. 2002.
WorldCat is noteworthy for its breadth of publication for- Information Sources in the Social Sciences. Munich: K. G.
mats and years of coverage. Although helpful for locating Saur.
USING REFERENCE DATABASES 101

Getz, Malcolm. 2004. Open-Access Scholarly Publishing in Payne, David G., Michael P. Toglia, and Jeffrey S. Anastasi.
Economic Perspective. Working Paper 04-W14. Nashville, 1994. Recognition Performance Level and the Magnitude
Tenn.: Vanderbilt University, Department of Economics. of the Misinformation Effect in Eyewitness Memory. Psy-
Glanville, Julie M., Carol Lefebvre, Jeremy N. V. Miles, and chonomic Bulletin & Review 1(3): 37682.
Janette Cmosso-Stefinovic. 2006. How to Identify Rando- Reed, Jeffrey G., and Pam M. Baxter. 2003. Library Use:
mized Controlled Trials in MEDLINE: Ten Years On. Jour- Handbook for Psychology, 3rd ed. Washington, D.C.: Ame-
nal of the Medical Library Association 94(2): 13036. rican Psychological Association.
Herron, Nancy L., Ed. 2002. The Social Sciences: A Cross- Robinson Karen A, and Kay Dickersin. 2002. Development
Disciplinary Guide to Selected Sources, 3rd ed. Englewood, of a Highly Sensitive Search Strategy for the Retrieval of
Colo: Libraries Unlimited. Reports of Controlled Trials Using Pub Med. Internatio-
Higgins, Julian, and Sally Green, eds. 2006. Cochrane Hand- nal Journal of Epidemiology 31(1): 15053.
book for Systematic Reviews of Interventions, version 4.2.6. Sampson, Margaret, Nicholas J. Barrowman, David Moher,
Chichester: John Wiley & Sons. http://www.cochrane.org/ Terry P. Klassen, et al. 2003. Should Meta-analysts Search
resources/handbook/hbook.htm. Embase in Addition to Medline? Journal of Epidemiology,
Hofstede, Geert. 1987. The Applicability of McGregors 56(October): 94355.
Theories in South East Asia. Journal of Management De- Shannon, Claude Elwood. 1940. A Symbolic Analysis of
velopment 6(3): 918. Relay and Switching Circuits. Masters thesis. Massachu-
. 1993. Leadership: Understanding the Dynamics of setts Institute of Technology, Department of Electrical En-
Power and Influence in Organizations. Academy of Mana- gineering.
gement Executive 7(1): 8193. Siegel, Judith M., and Elizabeth F. Loftus. 1978. Impact of
Johnson, Craig L. 1978. The Effects of Arousal, Sex of Anxiety and Life Stress upon Eyewitness Testimony. Bul-
Witness and Scheduling of Interrogation on Eyewitness letin of the Psychonomic Society 12(6): 47980.
Testimony. Dissertation Abstractions International 38(9-B) Singer, Robert N., Gene G. Korienek, and Susan Risdale.
(March): 44278. 1980. The Influence of Learning Strategies in the Acquisi-
Koriat, Asher, Morris Goldsmith, and Ainat Pansky. 2000. tion, Retention, and Transfer of a Procedural Task. Bulle-
Toward a Psychology of Memory. Annual Review of Psy- tin of the Psychonomic Society 16(2): 97100.
chology 51: 481537. Sporer, Siegfried Ludwig, Steven Penrod, Don Read, and
Loftus, Elizabeth F. 1975. Leading Questions and the Brian Cutler. 1995. Choosing, Confidence, and Accuracy:
Eyewitness Report. Cognitive Psychology 7(4): 56072. A Meta-Analysis of the Confidence-Accuracy Relation in
Loftus, Elizabeth F. 1979, 1996. Eyewitness Testimony. Eyewitness Identification Studies. Psychological Bulletin
Cambridge, Mass.: Harvard University Press. 118(3): 31527.
Mann, Thomas. 2005. The Oxford Guide to Library Re- Steblay, Nancy, Jennifer Dysart, Solomon Fulero, and
search. Oxford: Oxford University Press. R. C. L. Lindsay. 2003. Eyewitness Accuracy Rates in Po-
National Library of Medicine. 2000. 2000 Cumulated Index lice Showup and Lineup Presentations: A Meta-Analytic
Medicus: The End of an Era. NLM Newsline, 55(4). http:// Comparison. Law and Human Behavior 27(5): 52340.
www.nlm.nih.gov/archive/20040423/pubs/nlmnews/octdec Venn, John. 1880. On the Diagrammatic and Dechanical
00/2000CIM.html Representation of Propositions and Reasonings. Dublin
National Library of Medicine. 2004. Index Medicus to Philosophical Magazine and Journal of Science 9(59):
Cease as Print Publication. NLS Technical Bulletin 338. 118.
http://www.nlm.nih.gov/pubs/techbull/mj04/mj04_im.html. Weill Medical College of Cornell University. (2008). EM-
National Library of Medicine. 2006. Index Medicus Chrono- BASE. http://library2.med.cornell.edu/Tutorials/RLS/ch4_
logy. http://www.nlm.nih.gov/services/indexmedicus.html. BiomedDBClass/4.2.3_majorDB_embase.html#.
National Technical Information Service. 2006. Government Wells, Gary L. 1978. Applied Eyewitness Testimony Re-
Research Center: NTIS Database. Washington, D.C.: U. S. search: System Variables and Estimator Variables. Journal
Department of Commerce, Technology Administration, of Personality & Social Psychology 36:89103.
National Technical Information Service. http://grc.ntis.gov/ Wells, Gary L, and Elizabeth A. Olson. 2003. Eyewitness
ntisdb.htm. Testimony. Annual Review of Psychology 54: 27795.
6
GREY LITERATURE

HANNAH R. ROTHSTEIN SALLY HOPEWELL


Baruch College U.K. Cochrane Centre

CONTENTS
6.1 Introduction 104
6.2 Definition of Terms 104
6.3 Retrieving Grey Literature 105
6.3.1 The Impact of the Literature Search 105
6.3.2 Information Search 106
6.3.2.1 Search Terms and Search Strategies 106
6.3.2.2 Information Sources 107
6.3.2.2.1 Electronic Bibliographic Databases 107
6.3.2.2.2 Hand Searching Key Journals 108
6.3.2.2.3 Conference Proceedings 108
6.3.2.2.4 Reference Lists and Citations 109
6.3.2.2.5 Researchers 109
6.3.2.2.6 Research Registers 110
6.3.2.2.7 Internet 110
6.4 How to Search 110
6.5 Grey Literature in Research Synthesis 115
6.5.1 Health-Care Research 115
6.5.2 Social Science and Educational Research 118
6.5.3 Impact of Grey Literature 118
6.5.3.1 Health Care 118
6.5.3.2 Social Sciences 119
6.6 Problems with Grey Literature 120
6.7 Conclusions and Recommendations 121

103
104 SEARCHING THE LITERATURE

6.8 Acknowledgments 122


6.9 Notes 122
6.10 References 122

6.1 INTRODUCTION by the European Association for Grey Literature Exploi-


tation (EAGLE). Unfortunately, this database has not
The objective of this chapter, and of the original in the been updated since 2004, and its future is uncertain.
1994 edition of this handbook, on which this one is based, The first scholarly society devoted to the study of grey
is to provide methods of retrieving hard to find literature literature, the Grey Literature Network Service (Grey-
and information on particular areas of research.1 We begin Net), was founded in 1993. This group sponsors an an-
by briefly outlining the many developments that have nual international conference, produces a journal, and
taken place in retrieval of hard-to-find research results maintains a list of grey literature resources in the service
over the past decade, including the formalization of this of its mission to facilitate dialog, research, and commu-
work as a distinct research specialization. nication between persons and organizations in the field of
Although we continue to focus on how to retrieve hard grey literature. . . . [and] to identify and distribute infor-
to find literature, developments in research synthesis mation on and about grey literature in networked envi-
methodology allow us to place retrieval of this literature ronments (http://www.greynet.org). The work of this
in the context of the entire research synthesis process. group has highlighted the fact that traditional published
The key point is that there is a critical relationship be- sources of scientific information can no longer be counted
tween the reliability and validity of a research synthesis on as the primary repositories of scientific knowledge.
and the thoroughness of and lack of bias in the search for
relevant studies.
Since 1994, changes in technology and in scholarship 6.2 DEFINITION OF TERMS
have created new challenges and new opportunities for
anyone interested in reading or synthesizing scientific lit- Grey literature, gray literature, and fugitive literature are
erature. As the Internet, the Web, electronic publishing, synonymous terms, grey literature the one in most com-
and computer-based searching have developed, grey lit- mon use as of this writing. The most widely accepted
erature has both grown and become more accessible to definition is the one agreed on at the Third International
researchers. Grey literature comprises an increasing pro- Conference on Grey Literature, held in Luxembourg in
portion of the information relevant to research synthesis; November 1997: that which is produced on all levels of
it is likely to be more current than traditional literature government, academics, business and industry in elec-
and channels for its distribution are improving. As a re- tronic and print formats not controlled by commercial
sult, sifting through the grey literature, as well as locating publishers (Auger 1998). The key aspect is that the liter-
it has become a major task of the information retrieval atures production is by a source for which publishing is
phase of a research synthesis. not the primary activity. This definition and focus may
The emergence of the idea of grey literature is not new have some relevance to research synthesists, but are not
and the concept is beautifully illustrated in a quotation by the primary reason we are interested in grey literature.
George Minde-Pouet in 1920, cited by Dagmar Schmid- Tim McKimmie and Joanna Szurmak have offered an al-
maier: No librarian who takes his job seriously can today ternate definition in proposing that any materials not
deny that careful attention has also to be paid to the little identifiable through a traditional index or database be
literature and the numerous publications not available in considered grey literature (2002). The key point here is
normal bookshops, if one hopes to avoid seriously dam- that the literature is hard to retrieve. This second defini-
aging science by neglecting these (1986). Acceptance of tion is closer to that of fugitive literature provided in the
the term dates back to 1978. A seminar at the University first edition and better reflects the purpose of this chapter.
of York in the UK was recognized as a milestone for grey Broad views of grey literature from this angle include just
literature as a primary information source and resulted in about everything that is not published in a peer-reviewed
the creation of the System for Information on Grey Lit- academic journal, whether or not it is produced by those
erature in Europe database (SIGLE), which is managed for whom publishing is the primary activity. According to
GREY LITERATURE 105

Irwin Weintraub, scientific grey literature comprises available) is systematically unrepresentative of the popu-
newsletters, reports, working papers, theses, government lation of completed studies. This is not a hypothetical
documents, bulletins, fact sheets, conference proceedings issue: evidence that publication bias has had an impact on
and other publications distributed free, available by sub- research syntheses in many areas has been clearly dem-
scription or for sale (2000). For present purposes, all onstrated. Initially, the concern was that research was ac-
documents except journal articles that appear in widely cepted for publication based on the statistical significance
known, easily accessible electronic databases will be of the studys results, rather than on the basis of its qual-
considered grey literature. Although not all forms of this ity (Song et al. 2000). Kay Dickersin provided a concise
literature are equally fugitive, all are in danger of being and compelling summary of the research showing that
overlooked by research synthesists if they rely exclu- this type of publication bias exists in both the social and
sively on searches of popular electronic databases to biomedical sciences, for experimental as well as obser-
identify relevant studies. For example, for many purposes vational studies (2005). This body of research includes
books and book chapters should be viewed as nearly grey direct evidence including editorial policies, results of sur-
literature, because though they are produced and distrib- veys of investigators and follow up of studies registered
uted commercially by publishers and may be widely at inception, all of which indicate that statistically sig-
available, studies located in them are often difficult to nificant studies are less likely to be published, and, if they
identify through typical search procedures. are, published more slowly than significant ones, which
is also known as time-lag bias (Hopewell and Clarke
6.3 RETRIEVING GREY LITERATURE 2001). It also includes persuasive indirect evidence that
publication bias exists, including the overrepresentation
6.3.1 The Impact of the Literature Search of significant results in published studies and negative
correlations between sample size and effect size in sev-
The replicability and soundness of a research synthesis eral large sets of published data.
depend heavily on the degree to which the search for rel- Readers should be aware that numerous potential in-
evant studies is thorough, systematic, unbiased, transpar- formation suppression mechanisms have the potential
ent, and clearly documented. Lack of bias is the most to bias how studies relevant to a research synthesis are
essential feature; if the sample of studies retrieved for identified and retrieved. One of these is language bias,
review is biased, the validity of the results . . . no matter because of the selective use by synthesists of mainstream
how systematic and thorough in other respects, is threat- U.S. and UK databases, selective searching for only Eng-
ened (Rothstein, Sutton, and Borenstein 2005, 2). Lit- lish-language material, or excluding material in other
erature searching, or information retrieval, is therefore a languages (Egger et al. 1997; Jni et al. 2002). Another is
critical part of the research synthesis process, albeit often availability bias, or the selective inclusion of studies that
underappreciated and underemphasized. Here we will de- are easily accessible to the researcher. A third is cost bias,
scribe various ways in which bias can enter a synthesis the selective inclusion of studies that are available free or
through an inadequate search, and then explain how a at low cost. A fourth is familiarity bias, the selective in-
more comprehensive search can minimize some of those clusion of studies only from ones own discipline. A fifth
biases. In particular, we focus on including grey literature is duplication bias, that is, studies with statistically sig-
as an attempt to lessen publication bias in research syn- nificant results are more likely to be published more than
theses, and we provide empirical examples from both the once (Tramer et al. 1997). Last is citation bias, whereby
health and social sciences to illustrate our points. studies with statistically significant results are more likely
The aim of a high quality literature search in a research to be cited by others and therefore easier to identify
synthesis is to generate as comprehensive a list as possi- (Gtzsche 1997; Ravnskov 1992).
ble of both published and unpublished studies relevant to In addition to the evidence that whole studies are selec-
the questions posed in the synthesis (Higgins and Green tively missing are demonstrations by An-Wen Chan and
2008). Unless the searcher is both informed and vigilant, his colleagues (Chan et al. 2004; Chan and Altman 2005)
however, this aim can be subverted in various ways and that reporting of outcomes within studies is often in-
bias can be introduced. Publication bias is the generally complete and biased, due to the selective withholding
accepted term for what occurs when the research that ap- of statistically nonsignificant results. Some are biases
pears in the published literature (or is otherwise readily introduced by the authors of primary studies; others are
106 SEARCHING THE LITERATURE

attributable to editorial policies of journals or of data- searches (for an example of the value added by a compe-
bases, while the researchers producing the synthesis intro- tent information specialist, see Wade et al. 2006). Suffi-
duce others. All lead to the same consequence, however, cient budget and time are also essential components of a
namely that the literature located for a synthesis may not good search. The likely difficulty of searching for grey
represent the population of completed studies. The biases literature and the probable need for issue by issue, page
therefore all present the same threat to the validity of a by page search of key journals (hand searching) also need
synthesis. Although it would be more accurate to refer to to be taken into account when planning the time to com-
all by the single, broad term dissemination bias, as Fujian pletion of the entire synthesis.
Song and his colleagues proposed (2000), we will follow Despite the increased costs likely to be needed to lo-
common usage and employ the term publication bias, cate and retrieve grey literature, however, there is little
while noting that readers should keep in mind the larger justification for conducting a synthesis that excludes it.
set of suppression mechanisms that we have listed here.
Those carrying out research syntheses need to ensure 6.3.2 Information Search
that they conduct as comprehensive a search as possible
to keep themselves from introducing bias into their work. 6.3.2.1 Search Terms and Search Strategies Before
It is important that they do not limit their searches to beginning the actual search, synthesis authors must clar-
those studies published in the English language, because ify the key concepts imbedded in their research question,
evidence suggests that those published in other languages and attempt to develop an exhaustive list of the terms that
may have different results (Egger et al. 1997; Jni et al. have been used to describe each concept in the research
2002). Similarly, it is important that synthesis authors go literature. A comprehensive search strategy will invari-
beyond databases they are most familiar with, those that ably involve searching for multiple terms that describe
are most easily accessible, and those that are free or inex- the independent variable of interest, and may also include
pensive. In expanding their search,, researchers should be others that describe the dependent variables or the target
aware that exclusive reliance on the references cited by population in which the synthesists are interested, or both
key relevant studies is likely to introduce bias (Gtzsche (see chapter 5 this volume for an example of how this is
1987). carried out).
The identification of unfamiliar databases can be made It is critical to take into account the fact that the search
relatively easy by using sources such as the Gale Direc- terms as well as strategies are likely to vary across differ-
tory of Databases; getting access to them, however, may ent sources of information. Authors are advised to consult
be much more difficult. Research synthesists should con- both standard reference tools such as subject encyclope-
sider attempting to get access to databases located at in- dias and subject thesauri (for example, The Encyclopedia
stitutions or universities other than their own. Another of Psychology, or the Dictionary of Education) as well as
possible option is to commission, for a fee, searches of specialized thesauri that have been created for specific
databases to which one cannot easily get access. For ex- databasessuch as the Thesaurus of Psychological Index
ample, GESIS (a German organization) can be commis- Terms for PsycINFO, the ERIC Thesaurus, or the U.S.
sioned to search a range of German language and other National Library of Medicine Medical Subject Headings
European databases (http://www.gesis.org/en/index.htm). (MeSH) for searching MEDLINEas well as key papers
Appropriate staff, including librarians or information that come up in preliminary searches. It is both efficient
specialists, are critical to conducting an effective and ef- and effective to work with an information specialist
ficient search. As Rosenthal pointed out, library reference at this point in the search. Our recommendation is a pre-
specialists are more likely to be knowledgeable about liminary search for each term in multiple databases, to
new sources of information and about changes in search- include databases that contain grey literature. This is to
ing techniques than the researcher is (1994). In addition, ensure that the terms retrieve both unpublished and pub-
working with a librarian is likely to increase the number lished relevant studies, and to generate synonyms to use
of terms that can be used in a search, because librarians during the main search. This work is particularly impor-
are likely to be aware of differences in terminology that tant for synthesists in the social sciences and education
may exist in different fields. Librarians are also helpful in because of the extremely variable quality of keywording,
acquiring documents that have been located during a indexing, and abstracting in these literatures (Grayson and
search, as well as in costing, documenting, and managing Gomersall 2003; Wade et al. 2006). Prospective authors,
GREY LITERATURE 107

no matter what their field of research, should be aware It is important to note, however, that limiting a search
that this is an iterative process, for which they must allo- solely to electronic databases has the potential to introduce
cate adequate time and staff. Trisha Greenhalgh and bias into a synthesis. A recent systematic examination of
Richard Peacock reported that their electronic search, in- studies compared the results of manual cover-to-cover
cluding developing, refining, and adapting search strate- searching of journals (hand searching) with those result-
gies to the different databases they used, took about two ing from searching electronic databases such as MED-
weeks of specialist librarian time, and that their hand LINE and EMBASE to identify reports of randomized
search of journals took a month (2005). trials. Hand searching identified 92 to 100 percent of all
There are several guides to literature searching for reports of randomized trials found. The Cochrane Highly
research syntheses in health care that may be of use to Sensitive Search Strategy, which is used to identify re-
both synthesists and librarians interested in retrieving ports of randomized trials in MEDLINE, identified 80
grey literature, including The Cochrane Handbook for percent of the total number of reports of trials found
Systematic Reviews of Interventions (Higgins and Green (Dickersin Scherer, and Lefebvre 1994). On the other
2008). In social science and education, prospective syn- hand, complex electronic searches found 65 percent and
thesis authors may wish to look at the Information Re- simple ones found only 42 percent. The results of elec-
trieval Methods Policy Brief (Rothstein, Lavenberg, and tronic searches were better when the search was limited
Turner 2003) produced by the Campbell Collaboration to English language journals (62 percent versus 39 per-
(www.campbellcollaboration.org/MG/IRMGPolicyBrief- cent). A major reason that the electronic searches failed
Revised.pdf). More general sources that might be helpful to identify more of the reports was that those reports pub-
include Arlene Finks Conducting Research Literature lished as abstracts or in journal supplements were not
Reviews (2004) and Chris Armstrong and Andrew Larges routinely indexed by the electronic databases. Restricting
Humanities and Social Sciences (2001). the search to full reports raised the retrieval rate of com-
6.3.2.2 Information Sources It is obvious that, for plex searches to 82 percent. An additional reason for the
individual syntheses, the research question will deter- poor performance of the electronic searches was that
mine which information sources should be searched. For some reports were published before 1991, when the Na-
searches related to simple and straightforward questions, tional Library of Medicines system for indexing trial re-
most of the studies may be identified by a subset of pos- ports was not as well developed as it is today (Hopewell
sible sources. As the topic of the synthesis gets more et al. 2002). Herbert Turner and his colleagues, who con-
complex, or becomes more likely to draw on a variety of ducted research comparing the retrieval ability of hand
research literatures, the importance of multiple methods searching with searches of electronic databases to iden-
of search increases dramatically. That said, we cannot tify randomized controlled trials in education, found
imagine a thorough search that did not consider at least similar results (2004). Their results, based on a search of
the following sources: twelve education and education related journals, indi-
cated that electronic searches identified, on average, only
electronic bibliographic databases one-third of the trials identified by hand searching.
hand searching key journals Tricia Greenhalgh and Richard Peacock examined the
yield of different search methods for their synthesis of the
conference proceedings diffusion of health-care innovations, and found that ap-
reference lists and citation searching proximately one-fourth of their studies were produced by
contacting researchers electronic searches, with a yield of one useful study every
forty minutes (2005). David Ogilvie and his colleagues
research registers found in their synthesis of the literature on promoting a
the Internet shift from driving a car to walking that only four of the
69 relevant studies were in first line health databases
6.3.2.2.1 Electronic Bibliographic Databases. Search- (2005). About half of the remaining studies were found in
ing multiple large electronic bibliographic databases such a specialized transportation database, and the others by
as MEDLINE, EMBASE, ERIC, PsycINFO, and Socio- Internet searches and chance.
logical Abstracts is in most cases the primary way to iden- Although searching large, general databases may pro-
tify studies for possible inclusion in a research synthesis. vide only a fraction of relevant studies, such searches are
108 SEARCHING THE LITERATURE

generally not difficult. Large databases typically have Collaborations have developed hand-search manuals and
controlled languages (thesauri), the full range of boolean related materials to assist with this activity that are avail-
search operators (for example, AND, OR, NOT), the abil- able for download (www.Cochrane.org and www.Camp
ity to combine searches, and a range of options for ma- bellcollaboration.org).
nipulating and downloading searches. Smaller and more 6.3.2.2.3 Conference Proceedings. Conference proce
specialized databases often contain material not indexed edings are a very important source for identifying studies
in their larger counterparts. They are often created and found in the grey literature for inclusion in research syn-
operated with limited resources, however, and thus more theses. These contain studies that are often not published
focused on getting the information in than on retrieving as full journal articles and their abstracts are not gener-
it. Most have rudimentary thesauri, if any, and a limited ally included in electronic databases; thus relevant stud-
set of search tools. These limitations can make searching ies may be identified only by searching the proceedings
of small, specialized databases time-consuming and inef- of the conference. Some conference proceedings are
ficient, but their potential yield can make them valuable. available electronically from professional society web-
Diane Helmer and colleagues, for example, found that sites, while others may be issued as compact discs, or
searches of specialized databases were the most effective only on paper. Many conferences do not even publish
strategy when producing potentially relevant studies for proceedings, and require a search of the conference pro-
two syntheses of randomized trials of health-care inter- gram, and subsequent contact with the abstract authors in
ventions, but their definition of specialized databases in- order to retrieve potentially relevant studies.
cluded several large trial registries as well as dissertation Considerable evidence shows that many studies re-
abstracts (2001). ported at conferences do not get published in journals,
6.3.2.2.2 Hand Searching Key Journals. Targeted and that this is often for reasons unrelated to the quality
searching of relevant individual journals can overcome of the study. A recent research synthesis of assessments
problems caused by poor indexing and incomplete data- found that fewer than half (44.5 percent) of research stud-
base coverage, but is labor intensive and time consuming. ies presented as abstracts at scientific conferences were
Potentially relevant journals in every discipline are not in- published in full. This figure was somewhat higher (63.1
dexed by database producers. Additionally, not all indexed percent) for abstracts that presented the results of ran-
articles are retrievable from databases. The proportion of domized trials. Most studies that were eventually pub-
studies relevant to a research synthesis that are likely to be lished appeared in full within three years of presentation
found through hand searching only and time involved in at the meeting (Scherer, Langenberg, and von Elm 2005).
finding these is hard to estimate. Greenhalgh and Peacock For a large proportion of trials, then, the only available
(2005) reported that for their synthesis, twenty-four of source of information about a trial and its results might be
495 useful sources were identified through hand search- an abstract published in the proceedings of a conference.
ing, with a yield of one useful article for every nine hours Consistent with these findings, a study of conference ab-
of search time. Diane Helmer and her colleagues exam- stracts for research into the methods of health-care re-
ined the search strategies of two research synthesis, and search, found that approximately half of the abstracts had
reported that hand searching identified a small, but unique, not been or were unlikely ever to be published in full
fraction of potentially useful studies (2001). (Hopewell and Clarke 2001). The most common reason
The Cochrane Collaboration has invested significant given by investigators as to why research submitted to
time and effort hand searching health-care journals to lo- scientific meetings had not been subsequently published
cate reports of randomized trials of health-care interven- in full was lack of time. Other reasons included the re-
tions (Dickersin et al. 2002). The results are included in searchers belief that a journal was unlikely to accept
the Cochrane Central Register of Controlled Trials (CEN- their study, or that the authors had decided the results
TRAL) and details of which journals have been hand were not important (Callaham et al. 1998; Weber et al.
searched are contained in their Master List of Journals 1998; Donaldson and Cresswell 1996).
being searched. (http://apps1.jhsph.edu/cochrane/master- In the social sciences, Harris Cooper, Kristina DeNeve,
list.asp). The Campbell Collaborations register (C2- and Kelly Charlton (1997) and Katherine McPadden and
SPECTR) includes reports of trials of social, educational, Hannah Rothstein (2006) reported a slightly more com-
psychological or crime and justice interventions that have plex state of affairs, but one that is generally consistent
been located through hand searching; however this is a with the picture from health care. Cooper and his col-
relatively small effort. Both Cochrane and Campbell leagues surveyed thirty-three psychology researchers to
GREY LITERATURE 109

follow up on studies that had received institutional review health-care innovations, searching the reference lists of
board (IRB) approval. About two-thirds of these com- previously identified references yielded the greatest num-
pleted studies did not result in published summaries, and ber of useful studies (12 percent of the total useful mate-
half fell out of the process for reasons other than method- rial), and was also an efficient means of searching, with a
ological quality. Cooper and his colleagues also found yield of one useful paper for every fifteen minutes of
that research results that were statistically significant searching (2005). Helmer and her colleagues found that
were more likely than nonsignificant findings to be sub- scanning reference lists identified about a quarter of the
mitted for presentation or publication (1997). In a selec- potentially useful studies that were identified by means
tive sample of best papers (the top 10 percent as judged other than searching of electronic databases (2001).
by the evaluators) published in proceedings from three Similarly to manually scanning the reference lists of
years of Academy of Management conferences, McPad- relevant articles, researchers are now increasingly using
den and Rothstein found that 73 percent were eventually electronic citation search engines as a method of iden-
published in journals, with an average time to publication tifying additional studies for inclusion in their research
of twenty-two months from time of submission (2006). synthesis. Several examples of these include the Science
Nearly half of these papers (43 percent) included more or Citation Index, for use in health care, the Social Sciences
different data than was described in the conference pro- Citation Index and Social SciSearch, all of which are
ceedings. The most common changes were the type of available through the Web of Knowledge (http://isiwe-
analyses conducted and the number of outcomes reported. bofknowledge.com/), and SCOPUS, which is available
In a less selective sample of all papers presented at three through Elsevier (www.scopus.com).
years of annual conferences of the Society for Industrial It is important, however, to note that scanning refer-
and Organizational Psychology, they found that only 48 ence lists and citation searching should not be done in
percent were eventually published, and that 60 percent of isolation and must only be done in combination with
these papers contained data that was different than that other search methods described in this chapter. Overreli-
contained in the conference paper. In neither McPadden ance on this method of searching will introduce a real
and Rothstein sample was the statistical significance of threat of citation bias (Gtzsche 1987), because the more
findings the major reason given for nonsubmission to or a research study is cited the easier it will be to identify
nonpublication in a journal. Given the policies of these and therefore include in the research synthesis.
two conferences, however, it is, in our experience, un- 6.3.2.2.5 Researchers. Research centers and experts
likely that statistically nonsignificant findings would have are another important source of information that may not
been submitted or accepted for presentation. be available publicly. Contacting researchers who are ac-
Although conference proceedings and programs are tive in a field, subject specialists and, both for-profit and
an invaluable source of identifying grey studies, the re- nonprofit companies that conduct researchin health
search synthesist must be aware that the risk of duplica- care, the pharmaceutical industry or medical device com-
tion bias (with reports of the same study being found in panies, in psychology or education, test publishers, con-
different conference proceedings) may increase when he sulting firms, government agencies, and so oncan be an
or she includes these information sources, and should important way to identifying ongoing or completed but
take appropriate precautions to avoid this (see chapter 9, unpublished studies. For example, Greenhalgh and Pea-
this volume). cock reported success with using emails to colleagues for
6.3.2.2.4 Reference Lists and Citations. Searching identification of studies in their synthesis (2005). Helmer
reference lists of previously retrieved primary studies and and colleagues found that personal communication with
research synthesis is nearly universally recommended as other researchers yielded about 25 percent of the poten-
an effective method of identifying potentially relevant lit- tially useful studies that were identified by means other
erature. This method, also called snowballing, calls for than searching of electronic databases (2001). Retrieving
the reference sections of all retrieved articles to be exam- studies, once identified, is a separate task. Success in the
ined, and for any previously unidentified possibly rele- latter endeavor may depend on the goodwill of the re-
vant material to be retrieved. This process continues until searchers, the time they have available to respond to these
no new possibly relevant references are found. requests, and the time that has elapsed since the study
Two studies of research synthesis provide some evi- was conducted as well as the policies of organizations
dence of the effectiveness of this method. In Greenhalgh with proprietary interests. Evidence suggests that re-
and Peacocks research synthesis of the diffusion of quests for additional information about a study are not
110 SEARCHING THE LITERATURE

always successful (www.mrc-bsu.cam.ac.uk/firstcontact/ tiatives will take place in the social sciences or in other
pubs.html) (McDaniel, Rothstein, and Whetzel 2006). disciplines, though we would be happy to be wrong on
6.3.2.2.6 Research Registers. Research registers are this account.
an important way of identifying ongoing research, some 6.3.2.2.7 Internet. The Internet is an increasingly im-
of which may never be formally published. There are portant source of information about ongoing and com-
many national registers such as the National Research pleted research, particularly that which has not been
Register in the United Kingdom, which contains ongoing formally published. Sources of grey literature that are
health research of interest to, and funded by, the U.K. accessed primarily by Internet searching include the web-
National Health Service (www.doh.gov.uk/research/nrr. sites of universities, major libraries, government agen-
htm). There are also numerous local and specialist regis- cies, nongovernmental organizations including nonprofit
ters, each of which is specific to its own area of interest. research organizations, think-tanks and foundations, pro-
Identifying ongoing studies is important so that whenever fessional societies, corporations and advocacy groups.
a synthesis is updated, these studies can be assessed for Specific examples of some of these are listed in section
possible inclusion. The identification of ongoing studies 6.4.2.2.6. The Internet is often particularly helpful for
also helps the synthesist to assess the possibility of time- identifying documents produced by local government au-
lag bias whereby studies with statistically significant re- thorities because these are increasingly published online.
sults are more likely to be published sooner than those Searching the Internet, can, however, be a major un-
with nonsignificant results (Hopewell et al. 2001). dertaking. According to the University of British Colum-
Some pharmaceutical companies are making public bia Library (http:/toby.library.ubc.ca) almost half of the
records of their ongoing and completed trials, either Internet is unreachable or invisible using standard search
through the metaRegister of Controlled Trials (www. engines, and specific search tools are needed to access
controlled-trials.com) or by providing registers of re- this content. The Internet searching skills of many syn-
search on their own website. For example, an initiative by thesists are fairly rudimentary. Even when searchers are
Glaxo Wellcome has led to the development of a register skilled, many of the general search engines do not allow
of phase II, III and IV studies on newly registered medi- sophisticated searching and may identify thousands of
cines. (http://ctr.glaxowellcome.co.uk). In 2004, the Ca- Internet sites to check, many of which are irrelevant or
nadian Institutes of Health Research hosted a meeting to untrustworthy. Even when potentially relevant grey stud-
develop a plan for international registration of all studies ies appear that are familiar to experts in a particular field,
of health related interventions at their inception. That they often are subject to search conventions that are idio-
meeting led to issuance of the Ottawa Statement, a set of syncratic to the site, or even to the specific collection in
guiding principles for the development of research regis- which they appear.
ters (Krleza-Jeric et al. 2005). Among the objectives of Despite the popularity of Internet searches among re-
this movement are to identify and deter unnecessary search synthesists, very few empirical studies have been
duplication of research and publications, and to identify conducted to establish whether searching the Internet is
and eliminate reporting biases. The statement calls for generally useful in identifying additional unpublished or
results for all outcomes and analyses specified in the pro- ongoing studies. Gunther Eysenbach and colleagues con-
tocol as well as any adverse effects data to be publicly ducted one of the few studies that examined this question
registered, whether or not the information is published (2001). These authors adapted search strategies from
elsewhere. The World Health Organization (WHO) is seven research syntheses in an attempt to find additional
also working on standards for international trials registra- relevant trials. They searched 429 websites in twenty-one
tion. In August 2005, the WHO established the Interna- hours and located information about fourteen unpub-
tional Clinical Trials Registry Platform with the primary lished, ongoing, or recently published trials. A majority
objective of ensuring that all clinical trials worldwide of these were considered relevant for inclusion in one of
are uniquely identifiable through a system of public reg- the research syntheses.
isters, and that a minimum set of results is made publicly
available (www.who.int/ictrp). As these initiatives begin
to influence governmental and pharmaceutical industry 6.4 HOW TO SEARCH
policies, the amount of grey health-care data available
to synthesists from research registers should increase A number of electronic resources, available and relatively
substantially. It seems unlikely at present that similar ini- easily accessible, may prove useful in searching for grey
GREY LITERATURE 111

Table 6.1 Databases, Websites, and Gateways


Source Type of Material Cost

General
British Librarys Inside Web Includes references to papers presented at over 100,000 Subscription
conference proceedings.
Dissertation and Theses Database PQDTFull text includes 2.4 million dissertation citations Subscription
http://www.proquest.com/ from around the world from 1861 to the present day
products_pq/descriptions/pqdt together with full text dissertartion that are available for
.shtml/. download in PDF format.
Evidence Network A service provided by the ESRC U.K. Centre for Evidence Free access
www.evidencenetwork.org. Based Policy and Practice. The Resources section of the
website covers six major types of information resource:
bibliographic and research databases; internet gateways;
systematic review centers; EBPP centers; research centers;
and library services.
Index to Theses Claims to have all U.K. theses back to 1716, as well as a Subscription
http://www.theses.com/ subset of Irish theses.
NTIS (National Technical Information Has a heavy technical emphasis, but is the largest central Subscription
Service) resource for U.S. government-funded scientific, technical,
http://www.ntis.gov/. engineering, and business related information available.
Contains over half a million reports in over 350 subject
areas from over 200 federal agencies. There is limited free
access, and affordable 24-hour access.
System for Information on Grey A project of EAGLE (European Association for Grey Free access
Literature in Europe (SIGLE) Literature Exploitation), SIGLE is a bibliographic database
covering European grey literature in the natural sciences
and technology, economics, social sciences, and humanities.
It includes research and technical reports, preprints,
working papers, conference papers, dissertations and
government reports. All grey literature held by the British
Library in the National Reports Collection can be retrieved
from here. This database has not been updated since 2004;
the current status of this database is uncertain, and it may
cease to be available.
Healthcare, Biomedicine and Science
BiomedCentral Contains conference abstracts, papers and protocols of Free access
www.biomedcentral.com/. ongoing trials.
U.S. National Library of Medicine Gateway for most U.S. National Library of Medicine holdings Free access
(NLM) Gateway and includes meeting abstracts relevant to health care and
http://gateway.nlm.nih.gov/gw/Cmd health technology assessment.
Social Science, Economics and Education
ESRC Society Today Contains descriptions of projects funded by the U.K. Free access
http://www.esrc.ac.uk/ Economic and Social Research Council and links to
ESRCInfoCentre/index.aspx. researchers websites and publications.

source: Authors compilation.

literature. They can be sorted into four main types. Grey literature. Specific bibliographic databases contain stud-
literature databases include studies found only or primar- ies published in the grey and conventional literature in a
ily in the grey literature. General bibliographic databases specific subject area. Ongoing trial registers offer infor-
contain studies published in the grey and conventional mation about planned and ongoing studies. Tables 6.1
112 SEARCHING THE LITERATURE

Table 6.2 General Bibliographic Databases


Source Type of Material Cost

General
Science Citation Index Provides access to current and past bibliographic information, Subscription
http://scientific.thomson.com/ author abstracts, and cited references for a broad array of
products/sci/ science and technical journals in more than 100 disciplines,
and includes meeting abstracts that are not covered in
MEDLINE or EMBASE.
SCOPUS Claims to be the largest database of citations and abstracts Subscription
http://info.scopus.com/. with coverage of: Open Access journals, conference
proceedings, trade publications and book series. Also
provides access to many web sources such as author
homepages, university websites and pre-print services.
Currently has 28 million abstracts, and their references.
Healthcare, Biomedicine and Science
Database of Abstracts of Reviews of Contains summaries of systematic reviews about the effects of Free access
Effects (DARE) interventions on a broad range of health and social care
http://www.crd.york.ac.uk/crdweb/ topics.
NHS Economic Evaluation Database Contains summaries of studies of the economic evaluation of Free access
(NHS EED) healthcare interventions culled from a variety of electronic
http://www.crd.york.ac.uk/crdweb/ databases and paper-based resources.
Health Technology Assessment (HTA) Contains records of the reviews produced by the members of Free access
Database the International Network of Health Technology
http://www.crd .york.ac.uk/crdweb/ Assessment, spanning 45 agencies in 22 countries.
Ongoing Reviews Database A register of ongoing systematic reviews of health care Free access
http://www.crd.york.ac.uk/crdweb/ effectiveness, primarily from the United Kingdom.
Social Science, Economics and Education
Social Sciences Citation Index Provides access to current and past bibliographic information, Subscription
http://scientific.thomson.com/ author abstracts, and cited references for a large number of
products/ssci/ social science journals in over 50 disciplines. Includes
conference abstracts relevant to the social sciences from
1956 to date.
SOLIS (Social Sciences Literature German language social science database including grey Subscription
Information System) literature and books as well as journals searchable in
http://www.cas.org/ONLINE/DBSS/ English. Access may be available via Infoconnex service at
solisss.html http://www.infoconnex.de/
SSRN (Social Science Research Contains abstracts of over 100,000 working papers and Free access
Network) forthcoming papers in accounting, economics,
http://www.ssrn.com/index.html entrepreneurship, management, marketing, negotiation, and
similar disciplines .Also has full text, downloadable text for
about 100,000 documents in these fields.

source: Authors compilation.

through 6.5 present annotated lists of important sources tioned when we are aware of them. These sources are cor-
of grey literature in each of the four categories, as well as rect at the time of press but may change.
sources of grey literature that did not fit in any of the Whichever sources are searched, it is important that
other categories. Those that are not of general interest pri- the authors of the research synthesis are explicit and
marily cover either health care, or one of the social sci- document in their synthesis exactly which sources have
ences or education. Databases for some other fields in been searched (and when) and which strategies they have
which research syntheses are commonly done are men- used so that the reader can assess the validity of the
GREY LITERATURE 113

Table 6.3 Subject- or Country-Specific Databases


Source Type of Material Cost

Healthcare, Biomedicine and Science


Alcohol Concern Holds the most comprehensive collection of books and Free access
http://www.alcoholconcern.org journals on alcohol-related issues in the United
.uk/ servlets/wrapper/publications Kingdom.
.jsp.
Biological Abstracts Indexes articles from over 3,700 serials each year, Subscription
http://scientific.thomson.com/ indexes papers presented at meeting symposia and
products/ba/ workshops in the biological sciences.
CINAHL Includes nursing theses, government reports, Free access to
www.cinahl.com. guidelines, newsletters, and other research relevant to some
nursing and allied health. information
Computer Retrieval of Information on A searchable database of federally funded biomedical Free access
Scientific Projects (CRISP) research projects, and is maintained by the U.S.
http://crisp.cit.nih.gov/. National Institutes of Health. It includes projects
funded by the National Institutes of Health, Substance
Abuse and Mental Health Services, Health Resources
and Services Administration, Food and Drug
Administration, Centers for Disease Control and
Prevention, Agency for Health Care Research and
Quality, and Office of Assistant Secretary of Health.
DIMDI A gateway to 70 databases on medicine, drugs and Free access
http://www.dimdi.de/dynamic/en/ toxicology sponsored by the German Institute of
index.html. Medical Documentation and Information. Searchable
in English and German.
DrugScope Provides access to information and resources on drugs, Charge per
http://www.drugscope.org.uk/library/ drug misuse, and related issues. individual
librarysection/libraryhome.asp article
Health Management Information This database is updated bimonthly and contains Subscription
Consortium Database (HMIC) essential information from two key institutions:
http://www.ovid.com/site/catalog/ the Library & Information Services of the
DataBase/99.jsp?top2&mid Department of Health, England and the Kings Fund
3&bottom7&subsection10. Information & Library Service. It contains data from
1983 to the present.
Social Science, Economics and Education
British Library for Development Studies Calls itself Europes largest collection on economic Free access
http://blds.ids.ac.uk/blds/ and social change in developing countries
Cogprints Provides material related to cognition from psychology, Subscription, but
http://cogprints.ecs.soton.ac.uk/. biology, linguistics, and philosophy, and from provides free
computer, physical, social, and mathematical 30-day trial
sciences. This site includes full-text articles, book
chapters, technical reports, and conference papers.
Criminal Justice Abstracts Contains comprehensive coverage of international Subscription, but
http://www.csa.com/factsheets/cja- journals, books, reports, dissertations and provides free
set-c.php?SIDc4fg6rpjg unpublished papers on criminology and related 30-day trial
3v5nmn09k2kml). disciplines.
Education Line Database of the full text of conference papers, working Free access
http://www.leeds.ac.uk/educol/. papers and electronic literature which supports
educational research, policy and practice in the
United Kingdom.

(Continued)
114 SEARCHING THE LITERATURE

Table 6.3 (Continued)


Source Type of Material Cost

EconLIT Provides a comprehensive index of thirty years worth of Subscription


http://www.econlit.org/ journal articles, books, book reviews, collective volume
articles, working papers and dissertations in economics,
from around the world.
Education Resources Information Contains nearly 1.2 million citations going back to 1966 Free access
Center (ERIC) http://www.eric and, access to more than 110,000 full-text materials at
.ed.gov/. no charge. Includes report literature, dissertations,
conference proceedings, research analyses, transla-
tions of research reports relevant to education and socio-
logy.
International Transport Research A database of information on traffic, transport, road Subscription
Documentation (ITRD) safety and accident studies, vehicle design and safety,
http://www.itrd.org/. and environmental effects of transport, construction of
roads and bridges etc. Includes books, reports, research
in progress, conference proceedings, and policy-related
material.
Labordoc Produced by the International Labour Organisation, Free access
http://www.ilo.org/public/english/ contains references and full text access to literature on
support/lib/labordoc/index.htm work. Can be searched in English, French or Spanish.
National Criminal Justice Reference United States government funded resource policy, and Free access
Service (NCJRS) program development worldwide. Includes reports,
http://www.ncjrs.gov/. books and book chapters as well as journal articles.
Public Affairs Information Service Contains references to more than 553,300 journal articles, Subscription
(PAIS) books, government documents, statistical directories,
http://www.csa.com/factsheets/ research reports, conference reports, internet material
pais-set-c.php from over 120 countries.
Policy File Indexes research and publication abstracts addressing the Subscription
http://www.policyfile.com. complete range of U.S. public policy research.
PsycINFO Includes dissertations, book chapters, academic and Subscription
www.apa.org/psycinfo/. government reports and journal articles in psychology,
sociology and criminology, as well as other social and
behavioral sciences.
PSYNDEX Abstract database of psychological literature, audiovisual Subscription
http://www.zpid.de/retrieval/login media, and tests from German-speaking countries.
.php Journal articles, books, chapters, reports and
dissertations, are documented. Searchable in German
and English.
Research Papers in Economics International database of working papers, journal articles Free access
(Repec) http://repec.org/. and software components.
Social Care Online Includes U.K. government reports, research papers in Free access
http://www.scie-socialcareonline social work and social care literature.
.org.uk/.
Social Services Abstracts & InfoNet Includes dissertations in social work, social welfare, Subscription
http://www.csa.com/factsheets/ social policy and community development research
ssa-set-c.php. from 1979 to date. This is a subscription-based
compilation of content from U.K. databases Social
Care Online, AgeINFO, Planex and Urbadoc, all of
which include grey literature.
GREY LITERATURE 115

Table 6.3 (Continued)


Source Type of Material Cost

Sociological Abstracts Covers sociology as well as other social and behavioral Subscription
http://www.csa.com/factsheets/ sciences. It includes dissertations, conference
socioabs-set-c.php. proceedings and book chapters from 1952 to date.
Other-Including Public Policy
AgeInfo Produced by the U.K. Centre for Policy on Ageing (free Free access
http://ageinfo.cpa.org.uk/scripts/ access, but there is also subscription-based version
elsc-ai/hfclient.exe?Aelsc-ai&sk with better search facilities at http://www.cpa.org.uk/
ageinfo/ageinfo2.html.).
Australian Policy Online This is a database/gateway of Australian research and Free access
http://www.apo.org.au/. policy literature.
ChildData Produced by the U.K. National Childrens Bureau. Subscription
http://www.childdata.org.uk/.
ELDIS http://www.eldis.org/. This is a gateway to all kinds of full text grey literature on Free access
international development issues. It can be searched
using simple Boolean logic, or sub-collections can be
browsed.
IDOX Information Service database Covers service areas of relevance to U.K. local Subscription
(formerly Planex) government, especially in Scotland, including planning,
http://www.idoxplc.com/iii/ transport, social care, education, leisure, environment,
infoservices/idoxinfoservice/ local authority management and soon.
index.htm.
Directory Database of Research and A database service due to promote cooperation among Free access
Development Activities (READ) industry, academia, and government. A database
http://read.jst.go.jp/index_e.html service that provides information on research projects
in Japan. Searchable in English and Japanese.
TRIS Online Covers reports as well as journal papers in transportation Free access
http://trisonline.bts.gov/search and includes English language material from ITRD. It
.cfm is mainly technical but also contains material relevant
to, for example, transport-related health issues.
Urbadoc Similar to IDOX but focuses more on London and south Subscription
http://www.urbadoc.com/index of England.
.php?id12&langen.

source: Authors compilation.

conclusions of the synthesis based on the search that has included in Cochrane protocols and reviews published in
been conducted. The Cochrane Library in 1999 (2000). They found that
91.7 percent of references to studies included in synthe-
6.5 GREY LITERATURE IN RESEARCH SYNTHESIS ses were to journal articles, 3.7 percent were conference
proceedings that had not been published as journal arti-
6.5.1 Health-Care Research cles, 1.8 percent were unpublished material (for example
personal communication, in press documents and data on
Although one of the hallmarks of a research synthesis is file), and 1.4 percent were books and book chapters.
the thorough search for all relevant studies, estimates Susan Mallett and her colleagues looked at the sources
vary regarding the actual use of grey literature, and the of grey literature included in the first 1,000 Cochrane sys-
types of grey literature actually included. Mike Clarke tematic reviews published in issue 1, 2001 of The Co-
and Teresa Clarke conducted a study of the references chrane Library (2000). Of these, 988 syntheses were
116 SEARCHING THE LITERATURE

Table 6.4 Trial Registers


Source Type of Material Cost

Healthcare, Biomedicine and Science


BiomedCentral Publishes trials together with their International Standard Free access
www.biomedcentral.com/clinicaltrials/. Randomized Controlled Trial Number (ISRCTN) and
publishes protocols of trials
ClincalTrials.gov Covers U.S. trials of treatments for life-threatening diseases. Free access
http://clinicaltrials.gov
The Cochrane Central Register of Includes references to trial reports in conference abstracts, Free access
Controlled Trials (CENTRAL) handsearched journals, and other resources.
www.cochrane.org
Current Controlled Trials metaRegister of Contains records from multiple trial registers including the Free access
Controlled Trials U.K. National Research Register, the Medical Editors Trials
www.controlled-trials.com. Amnesty, the U.K. MRC
Trials Register, and links to other ongoing trials registers.
Database of Promoting Health Produced by the U.K. EPPI-Centre and includes completed as Free access
Effectiveness Reviews (DoPHER) well as unpublished randomized and nonrandomized trials
http://eppi.ioe.ac.uk/cms/. in health promotion.
The Lancet (Protocol Reviews) Contains protocols of randomized trials. Free access
www.thelancet.com
TrialsCentral www.trialscentral.org. Contains over 200 U.S.-based registers of trials. Free access
Social Science, Economics and Education
The Campbell Collaboration Social, Includes references to trials reported in abstracts, Free access
Psychological, Educational and handsearched journals,
Criminological Trials Register and ongoing trials.
(C2-SPECTR and C-PROT)
www.campbellcollaboration.org/

source: Authors compilation.

examined. All sources of information other than journal containing some type of grey literature contributed a me-
publications were counted as grey literature sources and dian of 28 percent of the included studies (interquartile
were identified from the reference lists and details pro- range 13 to 50 percent). In addition to studies that were
vided within the synthesis. The 988 syntheses together entirely grey, the grey literature also provided important
contained 9,723 included studies. Fifty-two percent supplemental information for a nontrivial proportion of
(n  513) included trials for which at least some of the published studies.
information was obtained from the grey literature. The Nine percent of the trials (588/6,266) included in the
sources of grey information in these syntheses are listed 513 Cochrane reviews using grey literature were retrieved
in table 6.6. The majority (55 percent ) was from unpub- from grey literature sources only. Of these grey trials,
lished information that supplemented published journal about two-thirds were retrieved from conference abstracts,
articles of trials. The high proportion of unpublished in- about 15 percent were from government or companies
formation in Cochrane reviews differs from other re- and about 5 percent were from theses or dissertations. The
search syntheses in health care, where the major source remaining grey trials were located through unpublished
of grey information was conference abstracts (Egger reference sources such as personal communication with
et al. 2003; McAuley et al. 2000). This may be because trialists, in press documents, and data on file. An analysis
Cochrane reviewers are encouraged to search for and use by Laura McAuley and colleagues (2000) of 102 trials
unpublished information. from grey literature sources in thirty-eight meta-analyses,
Within syntheses containing grey literature, studies found 62 percent were referenced by conference abstracts
GREY LITERATURE 117

Table 6.5 Other Electronic Sources


Source Type of Material Cost

General
Association of College and Research Libraries Reprints an article that lists websites about grey literature and Free access
http://www.ala.org/ala/acrl/acrlpubs/crlnews/ how to search for it, as well as free websites that contain
backissues2004/march04/graylit.htm. grey literature (some full-text). Mostly scientific and
technical literature, but also some resources for other fields.
CERUK (Current Educational Research in the UK) Database of current or ongoing research in education and Free access
http://www.ceruk.ac.uk/ceruk/. related disciplines. It covers a wide range of studies
including commissioned research and PhD theses, across
all phases of education from early years to adults.
DEFF Denmarks Electronic Research Library is a partnership Subscription
http://www.deff.dk/default.aspx?langenglish between Denmarks research libraries co-financed by the
Ministry of Science, Technology and Innovation, the
Ministry of Culture and the Ministry of Education.
GreyNet (Grey Literature Network Service) Website founded in 1993 and designed to facilitate dialog, Free access to
http://www.greynet.org/. research, and communication in the field of grey literature. some
Provides access to a moderated Listserv, a Distribution List, information
and the Grey Journal (TGJ).
Health Technology Assessment- Contains details of nearly 5,000 completed HTA publications Free access
www.york.ac.uk/inst/crd/htadbase.htm and around 800 ongoing INAHTA (International Network
of Agencies for Health Technology Assessment) projects.
INTUTE: Health and Life Sciences Internet resources in health and life sciences (Formerly known Free access
http://www.intute.ac.uk/healthandlifesciences/ as BIOME).
INTUTE: Medicine Internet resources in medicine. (Formerly known as OMNI). Free access
http://www.intute.ac.uk/healthandlifesciences/
medicine/
INTUTE: Social Sciences Internet resources in the social sciences. Free access
http://www.intute.ac.uk/socialsciences/
NLH http://www.library.nhs.uk/Default.aspx U.K. National Library for Health. Free access
to some
information
NOD Dutch Research Database contains information on scientific Free access
http://www.onderzoekinformatie.nl/en/oi/nod/. research, researchers and research institutes covering all
scientific disciplines.
NY Academy of Medicine: Grey Literature Page Focuses on grey literature resources in medicine. It features a Free access
http://www.nyam.org/library/pages/grey_ quarterly Grey Literature Report, that lists many items
literature_report available online.
School of Health and Related Research (ScHARR) Evidence-based resources in health care. Free access
www.shef.ac.uk/~scharr/ir/netting/.
Trawling the net Free databases of interest to healthcare researchers working in Free access.
www.shef.ac.uk/~scharr/ir/trawling.html the United Kingdom.
UK Centre for Reviews and Dissemination Provides a basic checklist for finding studies for systematic Free access
www.york.ac.uk/inst/crd/revs.htm reviews.
US Government Information: Technical Reports Provides an annotated listing of U.S. government agencies and Free access
http://www-libraries.colorado.edu/ps/gov/us/ other U.S. government resources for technical reports.
techrep.htm.
Virtual Technical Reports Central Contains a large list of grey literatureproducing institutions Free access
http://www.lib.umd.edu/ENGIN/TechReports/ and technical reports, research reports, preprints, reprints,
Virtual-TechReports.html. and e-prints.

source: Authors compilation.


118 SEARCHING THE LITERATURE

Table 6.6 Citations for Grey Literature in Cochrane In a synthesis of the health and psychological effects
Reviews of unemployment, 72 percent of included studies were
published and 28 percent were found in the grey litera-
Number of trial ture (Loo 1985). In a research synthesis by Anthony
Source references* Petrosino and colleagues of randomized experiments in
crime reduction, approximately one-third of the 150 tri-
Unpublished information 1,259 (55 percent)
als identified were reported in government documents,
Conference abstracts 805 (35 percent)
Government reports 78 (4 percent) dissertations and technical reports (2000). Earlier, a syn-
Company reports 66 (3 percent) thesis of early delinquency experiments reported that
Theses and dissertations 63 (3 percent) about 25 percent of the included studies were unpub-
Total (1,446 trials) 2,271 (100 percent) lished (Boruch, McSweeney, and Soderstrom 1978).
In a study conducted by Hannah Rothstein, the ninety-
source: Authors compilation. five meta-analyses published in Psychological Bulletin
* Trials can be referenced by more than one grey literature source. between 1995 and 2005 were examined to see whether
4,820/6,266 trials were referenced only by published journal arti-
cles. Seventeen trials had missing citations. they included unpublished research, and if so, what types
of research were included (2006). Of the ninety-five syn-
theses, sixty-nine clearly included unpublished data,
twenty-three clearly did not, one apparently did, one ap-
and 17 percent were from unpublished sources, with sim- parently did not, and in one case the description was so
ilar proportions to the Mallett and colleagues (2000) study unclear that it was impossible to tell. There did not appear
coming from government and company reports and the- to be any relationship between the publication date of the
ses. A study of 153 trials in sixty meta-analyses by Mat- synthesis and whether or not it included unpublished stud-
thias Egger and his colleagues found that 45 percent of ies. Complicating the matter was the fact that what was
trials came from conference abstracts, 1 percent from considered unpublished varied from study to study. All
book chapters, 3 percent from theses, and 37 percent from ninety-five meta-analyses included published journal arti-
other grey literature sources including unpublished, com- cles, sixty-one included dissertations or theses, forty-two
pany and government reports (2003). Overall, the results included books or book chapters, and twenty-six included
are similar across the studies: the majority of grey studies conference papers. On the other hand, seven included gov-
were retrieved from conference abstracts, though in all ernment or technical reports, six incorporated raw data,
cases more than one-third of the studies were obtained four stated that they used data from monographs, three
from other sources. used information from in-press articles, and two each ex-
plicitly mentioned test manuals or national survey data.
6.5.2 Social Science and Educational Research Unpublished data without a clear description of its source
were included in thirty-three meta-analyses, some of
The use of grey literature in the social sciences and edu- which mentioned specific types of grey literature as well.
cational research may be even higher than in health-care The limited evidence suggests that in the social sciences,
research, although empirical evidence in this area is more books, dissertations and theses and government/technical
limited. Based on survey data Diana Hicks presented in reports may be the most frequent sources of grey studies.
1999, Lesley Grayson and Alan Gomersall suggested
health-care and the social sciences differ greatly in the 6.5.3 Impact of Grey Literature
publication of research (2003). Social sciences research
is published in a much broader set of peer-reviewed jour- We now examine the evidence showing that there may be
nals; additionally a larger proportion of social science systematic differences in estimates of the same effect de-
research is exclusively published in books, governmental pending upon whether the estimates are reported in pub-
and organizational reports, and other sources that are not lished studies or in the grey literature.
as widely available as peer-reviewed journals. In her in- 6.5.3.1 Health Care An examination of health-care
ternational survey, Hicks identified books as the primary syntheses also shows systematic differences between the
nonjournal source of information in the social sciences results of published studies and those found in the grey
(1999). literature. An early example is a meta-analysis of preop-
GREY LITERATURE 119

erative parenteral nutrition for reducing complications patient data meta-analyses and showed that when pub-
from major surgery and fatalities (Detsky et al. 1987). In lished trials were included they produced a 4 percent
this study, the results of randomized trials published only greater treatment effect than trials as a whole did. Only one
as abstracts found the treatment to be more effective than study assessed the type of grey literature and its impact
trials published in full papers. on the overall results (McAuley et al. 2000). This study
On the other hand, the exclusion of grey literature found that published trials showed an even greater treat-
from meta-analyses in the health-care literature has been ment effect than grey trials if abstracts were excluded.
shown to produce both larger estimates of treatment ef- 6.5.3.2 Social Sciences Much of the early work on
fects and smaller effects. A recent synthesis by Sally Hope- the impact of grey literature comes from psychology.
well and her colleagues compared the effect of the inclu- Gene Glass and his colleagues were the first to demon-
sion and exclusion of grey literature on the results of a co- strate that there was a difference in effect sizes between
hort of meta-analyses of randomized trials in health care published and unpublished studies (1981). They analyzed
(2006). Five met the criteria for the synthesis (Burdett, eleven psychology meta-analyses published between 1976
Stewart, and Tierney 2003; Egger et al. 2003; Fergus- and 1980, from areas including psychotherapy, the effects
son et al. 2000; Hopewell 2004; McAuley et al. 2000). In of television on antisocial behavior and sex bias in coun-
all cases, there were more published trials included in the seling. In each of the nine instances where a comparison
meta-analyses than grey trialsa median of 224, the in- could be made, the average effect from studies published
terquartile range (IQR) was 108 to 365, versus one of 45, in journals was larger than the corresponding average ef-
IQR 40 to 102. Additionally, the median number of par- fect from theses and dissertations. On average, the find-
ticipants included in published trials was also larger than ings reported in journals were 33 percent more favorable
those in the grey trials. One study also assessed the statis- to the investigators hypothesis than those reported in the-
tical significance of the trial results included in the meta- ses or dissertations. In one meta-analysis of sex bias in
analyses (Egger et al. 2003). Here, the researchers found counseling and psychotherapy, the effect was of approxi-
that published trials (30 percent) were more likely to have mately equal magnitude, but in opposite directions in the
statistically significant results ( p  0.05) than grey trials published and the unpublished studies (Smith 1980).
(19 percent). Mark Lipsey and David Wilsons oft-cited study of meta-
For all five included studies, the most common type of analyses of psychological, educational and behavioral
literature was abstracts (55 percent). Unpublished data treatment research found that estimates of treatment ef-
were the second most common (30 percent). The defini- fects (standardized differences in means between the
tion of what constituted unpublished data varied slightly treatment group and the control group) from published
because Hopewell and her colleagues used the same defi- studies were larger than those from unpublished studies
nition the authors had used, which varied from study to (1993). Across the ninety-two meta-analyses for which
study. However, it generally included data from trial reg- comparison by publication status was possible, the aver-
isters, file-drawer data, and data from individual trialists. age treatment effect for published studies was .53 (ex-
The third most common type was book chapters (9 per- pressed as a standardized mean difference effect size); for
cent). The remainder (6 percent) was a combination of unpublished studies the average treatment effect was .39.
unpublished reports, pharmaceutical company data, in- Examinations of individual research syntheses show that
press publications, letters, and theses . in most cases in which effect sizes are broken down by
All five included studies found that published trials publication status, differences between published and un-
showed an overall greater treatment effect than grey trials published studies show smaller effects for unpublished
(Burdett, Stewart, and Tierney 2003; Egger et al. 2003; studies. G. A. de Smidt and Kevin Gorey contrasted the
Fergusson et al. 2000; Hopewell 2004; McAuley et al. results of an evaluation of social work interventions based
2000). This difference was statistically significant in one on published studies with those based on doctoral dis-
of the five studies (McAuley et al. 2000). Data could be sertations and masters theses only (1997). The meta-
combined formally for three of the five (Egger et al. analysis based on published studies showed slightly more
2003; Hopewell 2004; McAuley et al. 2000), which effective interventions than those in the dissertations and
showed that, on average, published trials showed a 9 per- theses, specifically 78 percent to 73 percent. In a meta-
cent greater effect than grey trials. The study by Sarah analysis of sixty-two randomized trials of interventions
Burdett and her colleagues (2003) included individual designed to change behavioral intentions, Thomas Webb
120 SEARCHING THE LITERATURE

and Paschal Sheeran found that published studies re- Elbaum found minimal effects, for both published (n  20)
ported larger effects for the average standardized differ- and unpublished (n  45) studies. The published studies,
ence in means between treatment and control groups than however, had a larger effect (d  0.05) than the unpub-
unpublished papers did: one-third of a standard deviation lished studies, which found no effect (d  0.00) (2002).
to one-fifth (2006). They also noted that studies reporting In a meta-analysis of the results of interventions to im-
interventions not finding statistically significant differ- prove the reading skills of learning disabled students,
ences in behavioral intentions were significantly less Lee Swanson found no differences in average effects by
likely to be published (33 percent) than studies showing publication status (1999. The average standardized dif-
them (89 percent published). Bryce McLeod and John ference in means between treatment and control groups
Weisz found similar results (2004). Their examination of for ninety-four independent samples from journal articles
121 dissertations and 134 published studies in the general was d  0.51, and the analogous effect for fourteen sam-
youth psychotherapy literature showed that the disserta- ples from dissertations and technical reports was d  0.52.
tions reported effects less than half the size of those in In a second meta-analysis, this time on interventions
journal articles on the same topics. Their findings are par- using single-subject designs, Swanson and Carole Sachse-
ticularly compelling because, overall, the dissertations Lee found an average treatment effect per study of d  1.42
were stronger methodologically, and there appeared to be for journal articles while for dissertations and technical
no differences in the steps taken to ensure treatment in- reports the analogous effect was slightly smaller, d  1.27
tegrity. Their conclusion was that they might have overes- (2000).
timated treatment effects because the majority of the
meta-analyses included only published studies. 6.6 PROBLEMS WITH GREY LITERATURE
An exception to the general finding is a study by John
Weisz, Carolyn McCarty, and Sylvia Valeri, which found Some researchers have suggested that studies from the
that effects were consistent across published and non grey literature should not be included in research synthe-
peer-reviewed studies (2006). They noted however, that ses because their methodological quality is suspect.
most youth depression psychotherapy literature is rela- Many research synthesists who include only published
tively recent, and they attribute their result to an increased journal articles seem to be concerned that other sources
focus by journals in the past few years on the quality of of data are lesser quality. John Weisz and his colleagues,
research procedures, and a decreased emphasis on statis- for example, said this: We included only published psy-
tically significant results. chotherapy outcome studies, relying on the journal re-
In the crime and justice literature, Friedrich Losel and view process as one step of quality control (1995, 452).
Martin Schmucker found that the mean effect size for Pamela Keel and Kelly Klump explained their study this
published studies was greater than for unpublished stud- way: We restricted our review to published works. Ben-
ies, an odds ratio 1.62 versus one of 1.16 (2005). In a efits of this approach include the increased scientific
meta-analysis on child skills training to prevent antisocial rigor of published (particularly peer-reviewed) work
behavior, Losel and Andreas Beelman reported no sig- compared with unpublished work and permanent avail-
nificant differences between the effects of published and ability of published work within the public domain for
unpublished studies, but neither the differences in effect independent evaluation by others (2003, 748). Because
sizes nor the power to detect a difference of a specific size of this, they therefore confined the search to studies
were reported (2003). published in peer-reviewed journals . . . to ensure that the
In the education literature, the available evidence on studies met scientific standards and to facilitate access to
this issue is very limited. A search of the ERIC database the studies by people who want to examine the sources of
using the terms education and meta-analysis for 1999 our data (366).
through mid-July 2006, followed by examination of the Other researchers believe that this view is unwarranted,
abstracts of all resulting studies and of those full-text arguing that publication in the academic peer-reviewed
articles available electronically yielded only three that journal literature has as much, or more, to do with the
included grey or unpublished literature and reported ef- incentive system for particular individuals than it does
fect sizes separately by information source. In a meta- with the quality of the research. As Grayson and Gomer-
analysis of the effects of classroom placement on the self sall pointed out, a report (not a journal article) is the pri-
concepts of children with learning disabilities, Batya mary publication medium for individuals working for
GREY LITERATURE 121

such entities as independent research organizations, pro- literature than those reported in peer-reviewed journals
fessional societies, and think-tanks (2003). They went on (Egger et al. 2003; MacLean et al. 2003). In an analysis
to say that researchers working in these environments of sixty meta-analyses of health-care interventions, Egger
may produce work of high quality but are free from the and colleagues found that trials published in the grey lit-
intense pressure . . . . to publish in peer-reviewed jour- erature were somewhat less likely to report adequate con-
nals (7). We believe that incentives to publish among cealment of allocation than published trials, 33.8 percent
degree recipients influence the decision to submit a dis- versus 40.7 percent (2003). Trials published in the grey
sertation or thesis for publication at least as much as the literature were also less likely to report blinding of out-
quality of the work does. Furthermore, it has also been come assessment compared to published trials, 45.1 per-
noted that though the traditional view of the peer review cent versus 65.8 percent. In contrast, a study by Hopewell
process is that it helps ensure the quality of published found no difference in the quality of reporting of alloca-
studies, peer review itself may be at times unreliable or tion concealment or generation of the allocation sequence
biased (Jefferson et al. 2006). Grayson described some between published and grey trials; this information was
of the deficiencies of peer review in the biomedical lit- often unclear in both groups (2004). Although quality
erature; this work has possible implications for the so- and adequacy of reporting are not the same as quality and
cial sciences as well (2002). Our view is that peer re- adequacy of the research, deficiencies in reporting make
view cannot be taken as a guarantee that research is of it difficult for a reviewer to assess a studys methodologi-
good quality, nor can the lack of peer review be taken as cal quality.
an indicator that the quality is poor. The research synthe- Several studies have shown that many abstracts sub-
sist must assess, de novo, the methodological quality of mitted to scientific conferences have inadequacies in the
each study considered for possible inclusion in his or her reporting aims, methods, data, results, and conclusions
synthesis. (Bhandari et al. 2002; McIntosh 1996; Panush et al. 1989;
A survey by Deborah Cook and Gordon Guyatt in Pitkin and Branagan 1999). A study of trials reported as
1993 showed that a substantial majority (78 percent) of abstracts and presented at the American Society of Clini-
meta-analysts and methodologists felt that unpublished cal Oncology conferences in 1992 and 2002 found that
material should definitely or probably be included in for most of the studies, it was not possible to determine
meta-analyses, but that journal editors were less enthusi- key aspects of trial quality because of the limited infor-
astic about the inclusion of unpublished studies (only mation available in the abstract. Information was also
47 percent felt that unpublished data should probably or limited on the number of participants initially random-
definitely be included). Cook and Guyatt found that both ized and the number actually included in the analysis
editors and methodologists attitudes varied depending (Hopewell and Clarke 2005).
on the type of material, with attitudes less favorable to-
ward including journal supplements, published abstracts, 6.7 CONCLUSIONS AND RECOMMENDATIONS
published symposia, book chapters, and dissertations and
more favorable toward unpublished presentations or ma- A number of key sources, then, are important to consider
terials that had never been published or presented. This when carrying out a comprehensive search for a research
survey was recently updated by Jennifer Tetzlaff and her synthesis. These include searching electronic bibliographic
colleagues (2006). Their findings showed that journal databases, hand searching key journals, searching confer-
editors still appear less favorable toward the grey (unpub- ence proceedings, contacting researchers, searching re-
lished) literature than research synthesists and metho- search registers, and searching the Internet. These sources
dologists do, though these differences have diminished are likely to identify studies in both the published and
relative to what Cook reported. Further analyses by Tet- grey literature. The optimal extent of the search depends
zlaff indicated that variations between journal editors and very much on the subject matter and the resources avail-
research synthesists and methodologists are related in able for the search. For example, those carrying out a re-
part to differences in research synthesis experience or pre- search synthesis need to consider when it is appropriate
vious familiarity with the grey literature. to supplement electronic searches with a search by hand
Empirical evidence supports the contention that it is of key journals or journal supplements not indexed by
harder to assess the methodological quality of random- electronic databases, and when it is to search by hand
ized trials of health-care interventions found in the grey conference proceedings for a specific subject area. It is
122 SEARCHING THE LITERATURE

also critical to assess the quality of the studies identified 6.8 ACKNOWLEDGMENTS
from the search, regardless of whether they were found in
the published or grey literature. Whatever is searched, it We would like to thank Anne Eisinga, Alan Gomersall,
is important that the researcher is explicit and documents Anne Wade, and especially Lesley Grayson for their as-
in the synthesis exactly which sources have been searched sistance with writing this chapter. We gratefully acknowl-
and which strategies have been used so that the reader can edge the path Marylu Rosenthal set out for us in preparing
assess the validity of the conclusions of the synthesis the original handbook chapter, and hope that the current
based on the search conducted. chapter will contribute as much as hers did to the im-
Ultimately, however, unless one knows the number of provement of research syntheses.
studies carried out for a particular intervention or other
research question (be that in health care, social sciences, 6.9 NOTES
education or other fields), regardless of the final publica-
tion status, one will not know how many studies have 1. In the near future, we may even have to amend the
been missed, despite the rigorous literature searches con- phrase from grey literature to grey material as written
ducted. Thus, the biggest issue is how to collect, organize, media may no longer be the only archival source of
and manage centralized repositories of grey literature. In scientific information. Consider, for example, the po-
the case of health care, one suggestion put forward many tential for podcasts or video clips of scientific reports
times is the prospective registration of randomized trials as sources of data relevant to a synthesis.
at their inception (Tonks 2002; Chalmers 2001; Horton
and Smith 1999). This would enable research synthesists 6.10 REFERENCES
and others using the evidence from randomized trials to
identify all trials conducted that are relevant to a particu- Armstrong, Chris J., and J. Andrew Large, eds. 2001. Ma-
lar health-care question. In addition, research-funding nual of On-Line Search Strategies: Humanities and Social
agencies (public, commercial, and charitable), will be Sciences, 3rd ed. Abingdon, U.K.: Gower.
able to make funding decisions in the light of information Auger, Charles P. 1998. Information Sources in Grey Litera-
about trials that are already being conducted. This will, ture. Guides to Information Sources, 4th ed. London: Bow-
therefore, avoid unnecessary duplication of effort and ker Saur.
help promote collaboration between individual organiza- Bhandari, Mohit, P. J. Devereaux, Gordon H. Guyatt, Debo-
tions. Finally, if information about ongoing trials is pub- rah J. Cook, Mark F. Swiontkowski, Sheila Sprague, and
licly available, patients and clinicians will be better in- Emil H. Schemitsch. 2002. An Observational Study of Or-
formed about trials in which they could participate, which thopaedic Abstracts and Subsequent Full-Text Publications.
will in turn also help increase trial recruitment rates. As Journal of Bone and Joint Surgery, American 84-A(4): 615
far as we are aware, there are currently no plans related to 21.
the prospective registration of field trials conducted in the Boruch, Robert F., A. J. McSweeney, and E. J. Soderstrom.
social sciences and in education. In theory, however, there 1978. Randomized Field Experiments for Program Plan-
is no reason why the same principles underpinning pro- ning, Development, and Evaluation: An Illustrative Biblio-
spective registration in health care cannot be applied in graphy. Evaluation Quarterly 2(4): 65595.
the social sciences and in education, and similar benefits Burdett, Sarah, Lesley A.Stewart, and Jane F. Tierney. 2003.
achieved. Research registries more broadly construed Publication Bias and Meta-Analyses: A Practical Exam-
would also be of benefit for questions that do not involve ple. International Journal of Technology Assessment in
interventions. In addition to listing ongoing studies, we Health Care 19(1): 12934.
can envision registries attached to clearinghouses that Callaham, Michael L., Robert L. Wears, Ellen J. Weber, Chris-
maintain archived copies of the reports from completed topher Barton, and Gary Young. 1998. Positive-Outcome
studies in different fields. This, however, would need col- Bias and Other Limitations in the Outcome of Research
laboration among professional societies, universities, Abstracts Submitted to a Scientific Meeting. Journal of
other research organizations, and most likely, govern- the American Medical Association 280(3): 25457.
ments. It is our hope that by the time this handbook is in Chalmers, Iain. 2001. Using Systematic Reviews and Re-
its third edition, we will be able to report on the creation gisters of Ongoing Trials for Scientific and Ethical Trial
and growth of these structures. Design, Monitoring, and Reporting. In Systematic Reviews
GREY LITERATURE 123

in Heath Care: Meta-Analysis in Context, 2nd ed., edited Peer-Reviewed Publications: An Unfulfilled Potential. Pu-
by Matthias Egger, George Davey Smith and Douglas G. blic Health, 110(1): 6163.
Altman. London: BMJ Books. Egger, M, Peter Jni, Christopher Bartlett, F. Holenstein,
Chan, An-Wen, and Douglas G. Altman. 2005. Identifying and Jonathan A.C. Sterne. 2003. How Important are Com-
outcome reporting bias in randomised trials on PubMed: prehensive Literature Searches and the Assessment of Trial
review of publications and survey of authors. British Me- Quality in Systematic Reviews? Empirical Study. Health
dical Journal 330(7494): 753. Technology Assessment 7(1): 176.
Chan, An-Wen, Asbjrn Hrbjartsson Mette T. Haahr, Egger, Matthias, Tanja Zellweger-Zhner, Martin Schneider,
Peter C. Gtzsche, and Douglas G. Altman. 2004. Empiri- Christoph Junker, Christian Lengeler, and Gerd Antes.
cal Evidence for Selective Reporting of Outcomes in Ran- 1997. Language Bias in Randomised Controlled Trials
domized Trials: Comparison of Protocols to Published Published in English and German. Lancet 350: 32629.
Articles. Journal of the American Medical Association Elbaum, Batya. 2002. The Self-Concept of Students with
291(20): 245765. Learning Disabilities: A Meta-Analysis of Comparisons
Clarke, Mike, and Teresa Clarke. 2000. A Study of the Across Different Placements. Learning Disabilities: Re-
References Used in Cochrane Protocols And Reviews. search and Practice 17(4): 21626.
Three Bibles, Three Dictionaries, and Nearly 25,000 Other Eysenbach, Gunther, Jens Tuische, and Thomas L. Diepgen
Things. International Journal of Technology Assessment Diepgen. 2001. Evaluation of the Usefulness of Internet
in Health Care 16(3): 9079. Searches to Identify Unpublished Clinical Trials for Syste-
Cook, Deborah J., and Gordon H. Guyatt. 1993. Should matic Reviews. Medical Informatics and the Internet in
Unpublished Data Be Included in Meta-Analyses? Jour- Medicine 26(3): 20318.
nal of the American Medical Association 269(21): 2749 Fergusson, Dean, Andreas Laupacis, L. Rachid Salmi
53. Finley A. McAlister, and Charlotte Huet. 2000. What
Cooper, Harris, Kristina DeNeve, and Kelly Charlton. 1997. Should Be Included in Meta-Analyses? An Exploration of
Finding the Missing Science: The Fate of Studies Submit- Methodological Issues Using the ISPOT Meta-Analyses.
ted for Review by a Human Subjects Committee. Psycho- International Journal of Technology Assessment in Health
logical Methods 2(4): 44752. Care 16(4): 110919.
de Smidt, G.A., and Kevin M. Gorey. 1997. Unpublished Fink, Arlene. 2004. Conducting Research Literature Re-
Social Work Research: Systematic Replication of a Recent views: From the Internet to Paper, 2nd ed. Thousand Oaks,
Meta-Analysis of Published Intervention Effectiveness Re- Calif.: Sage Publications.
search. Social Work Research 21(1): 5862. Glass, Gene V, Barry McGaw, and Mary Lee Smith. 1981.
Detsky, Allan S., Jeffrey P. Baker, Keith ORourke, and Vivek Meta-Analysis in Social Research. Thousand Oaks, Calif.:
Goel.1987. Perioperative Parenteral Nutrition: A Meta- Sage Publications.
Analysis. Annals of Internal Medicine 107(2): 195203. Gtzsche, Peter C. 1987. Reference Bias in Reports of Drug
Dickersin, Kay. 2005. Publication Bias: Recognizing the Trials. British Medical Journal 295(6599): 65456.
Problem Understanding its Origins and Scope, and Preven- Grayson, Lesley. 2002. Evidence-Based Policy and the
ting Harm. In Publication Bias in Meta-Analysis: Preven- Quality of Evidence: Rethinking Peer Review. ESRC
tion, Assessment and Adjustments, edited by Hannah R. Working Paper 7. London: ESRC UK Centre for Evidence
Rothstein, Alexander J. Sutton, and Michael Borenstein. Based Policy and Practice. http://www.evidencenetwork
Chichester: John Wiley & Sons. .org.
Dickersin, Kay, Roberta Scherer, and Carol Lefebvre. 1994. Grayson, Lesley, and Alan Gomersall. 2003. A Difficult Bu-
Identifying Relevant Studies for Systematic Reviews. siness: Finding the Evidence for Social Science Reviews.
British Medical Journal 309(6964): 128691. ESRC Working Paper 19. London: ESRC U.K. Centre for
Dickersin, Kay, Eric Manheimer, Susan Wieland, Karen A. Evidence Based Policy and Practice.
Robinson, Carol Lefebvre, and Steven McDonald. 2002. Greenhalgh, Tricia, and Richard Peacock. 2005. Effective-
Development of the Cochrane Collaborations CENTRAL ness and Efficiency of Search Methods in Systematic Re-
Register of Controlled Clinical Trials. Evaluation and the views of Complex Evidence: Audit of Primary Sources.
Health Professions 25(1): 3864. British Medical Journal 331(7524): 106465.
Donaldson I. J, and Patricia A. Cresswell. 1996. Dissemina- Helmer, Diane, Isabelle Savoie Carolyn Green, and Arminee
tion of the Work of Public Health Medicine Trainees in Kazanjian. 2001. Evidence-Based Practice: Extending the
124 SEARCHING THE LITERATURE

Search to Find Material for the Systematic Review. Bulle- Results from Human Trials of Health-Related Interventions:
tin of the Medical Librarians Association 89(4): 34652. Ottawa Statement. British Medical Journal 330(7497):
Hicks, Diana. 1999. The Difficulty of Achieving Full Cove- 95658.
rage of International Social Science Literature and the Bi- Lipsey, Mark W., and David B. Wilson. 1993. The Efficacy
bliometric Consequences. Scientometrics 44(2): 193215. of Psychological, Educational, and Behavioral Treatment:
Higgins, Julian T., and Sally Green. 2008. Cochrane Hand- Confirmation from Meta-Analysis. American Psycholo-
book for Systematic Reviews of Interventions, version 5.0.0. gist 48(12): 11811209.
Chichester: John Wiley & Sons. Loo, John. 1985. Medical and Psychological Effects of
Hopewell, Sally. 2004. Impact of Grey Literature on Syste- Unemployment: A Grey Literature Search. Health Libra-
matic Reviews of Randomized Trials. PhD diss. Wolfson ries Review 2(2): 5562.
College, University of Oxford. Losel, Friedrich, and Andreas Beelman. 2003. Effects of
Hopewell, Sally, and Mike Clarke. 2001. Methodologists Child Skills Training in Preventing Antisocial Behavior: A
and Their Methods. Do Methodologists Write Up Their Systematic Review of Randomized Evaluations. Annals of
Conference Presentations or Is It Just 15 Minutes of Fame? the American Academy of Political and Social Science
International Journal of Technology Assessment in Health 587(1): 84109.
Care 17(4): 6013. Losel, Friedrich, and Martin Schmucker. 2005. The Effecti-
. 2005. Abstracts presented at the American Society veness of Treatment for Sexual Offenders: A Comprehen-
of Clinical Oncology conference: How Completely Are sive Meta-Analysis. Journal of Experimental Criminology
Trials Reported? Clinical Trials 2(3): 26568. 1(1): 11746.
Hopewell, Sally, Mike Clarke, Carol Lefebvre, and Roberta MacLean, Catherine H., Sally C. Morton, Joshua J. Ofman,
Scherer. 2002. Handsearching Versus Electronic Sear- Elizabeth A.Roth, and Paul G. Shekelle. 2003. How
ching to Identify Reports of Randomized Trials. Cochrane Useful Are Unpublished Data from the Food and Drug
Database of Methodology Reviews 2002(4): MR000001. Administration in Meta-Analysis? Journal of Clinical
Hopewell, Sally, Mike Clarke, Lesley Stewart, and Jane Epidemiology 56(1): 4451.
Tierney. 2001. Time to Publication for Results of Clinical Mallett, Susan, Sally Hopewell, and Mike Clarke. 2002.
Trials. Cochrane Database of Methodology Reviews The Use of Grey Literature in the First 1000 Cochrane
2001(3): MR000011. Reviews. Presented at the Fourth Symposium on Systema-
Hopewell, Sally, Steve McDonald, Mike Clarke, and Mat- tic Reviews, Pushing the Boundaries. Oxford (July 24,
thias Egger. 2006. Grey Literature in Meta-Analyses of 2002).
Randomized Trials of Health Care Interventions. Cochrane McAuley, Laura, Ba Pham, Peter Tugwell, and David
Database of Methodology Reviews 2006(2): MR000010. Moher. 2000. Does the Inclusion of Grey Literature In-
Horton, Richard, and Richard Smith. 1999. Time to Regis- fluence Estimates of Intervention Effectiveness Reported in
ter Randomised Trials. Lancet 354(9185): 113839. Meta-Analyses? Lancet 356(9237): 122831.
Jefferson, Tom, Melanie Rudin, Suzanne Brodney Folse, and McDaniel, Michael A,, Hannah R. Rothstein, and Deborah
Frank Davidoff. 2006. Editorial Peer Review for Improving Whetzel. 2006. Publication Bias: A Case Study of Four
the Quality of Reports of Biomedical Studies. Cochrane Test Vendor Manuals. Personnel Psychology 59(4):
Database of Methodology Reviews 2006(1): MR000016. 92753.
Jni, Peter, F. Holenstein, Jonathan AC Sterne, Christopher McIntosh, Neil. 1996. Abstract Information and Structure
Bartlett, and Matthias Egger. 2002. Direction and Impact at Scientific Meetings. Lancet 347(9000): 54445.
of Language Bias in Meta-Analyses of Controlled Trials: McKimmie, Tim, and Joanna Szurmak. 2002. Beyond Grey
Empirical Study. International Journal of Epidemiology Literature: How Grey Questions Can Drive Research.
31(1): 11523. Journal of Agricultural and Food Information 4(2): 7179.
Keel, Pamela K, and Kelly L. Klump. 2003. Are Eating McLeod, Bryce D., and John R. Weisz. 2004. Using Disser-
Disorders Culture-Bound Syndromes? Implications for tations to Examine Potential Bias in Child and Adolescent
Conceptualizing their Etiology. Psychological Bulletin Clinical Trials. Journal of Consulting and Clinical Psy-
129(5): 74769. chology 72(2): 23551.
Krleza-Jeric, Karmela, An-Wen Chan Kay Dickersin, Sim I, McPadden, Katherine, and Hannah R. Rothstein. 2006.
Jeremu Grimshaw, and Christian Gluud. 2005. Principles Finding the Missing Papers: The Fate of Best Paper Pro-
for International Registration of Protocol Information and ceedings. Presented at AOM Conference, Academy of
GREY LITERATURE 125

Management Annual Meeting. Atlanta, Ga. (August 1116, Swanson, H. Lee, and Carole Sachse-Lee. 2000. A Meta-
2006 ). Analysis of Single-Subject-Design Intervention Research
Ogilvie, David, Richard Mitchell Nanette, Mutrie N, Mark for Students with LD. Journal of Learning Disabilities
Petticrew, and Stephen Platt. 2006. Evaluating Health 33(2): 11436.
Effects of Transport Interventions: Methodologic Case Swanson, H. Lee. 1999. Reading Research for Students
Study. American Journal of Preventive Medicine 31(2): with LD: A Meta-Analysis of Intervention Outcomes.
11826. Journal of Learning Disabilities 32(6): 50432.
Panush, Richard S., Jeffrey C. Delafuente, Carolyn S. Tetzlaff, Jennifer, David Moher, Ba Pham, and Douglas Alt-
Connelly, N. Lawrence Edwards, Jonathan M. Greer, Shel- man. 2006. Survey of Views on Including Grey Literature
don Longley, and Fred Bennett. 1989. Profile of a Mee- in Systematic Reviews. Presented at the XIV Cochrane
ting: How Abstracts Are Written and Reviewed. Journal of Colloquium. Dublin (October 2326, 2006).
Rheumatology 16: 14547. Tonks, Alison. 2002. A Clinical Trials Register for Europe.
Petrosino, Anthony, Robert F. Boruch, C. Rounding, Steven British Medical Journal 325(7376): 131415.
McDonald, and Iain Chalmers. 2000. The Campbell Col- Tramer, Martin R., D. John M. Reynolds, R. Andrew Moore,
laboration Social, Psychological, Educational and Crimi- and Henry J. McQuay. 1997. Impact of Covert Duplicate
nological Trials Register C2-SPECTR. Evaluation and Publication on Meta-Analysis: A Case Study. British Me-
Research in Education 14(34): 20618. dical Journal 315(7109): 63540.
Pitkin, Roy M, and Mary Ann Branagan. 1998. Can the Ac- Turner, Herbert M, Robert Boruch, Julian Lavenberg,
curacy of Abstracts be Improved by Providing Specific Ins- J. Schoeneberger, and Dorothy de Moya. 2004. Electronic
tructions? A Randomized Controlled Trial. Journal of the Registers of Trials. Presented at the Fourth Annual Camp-
American Medical Association 280(3): 26769. bell Collaboration Colloquium, A First look at the Evi-
Ravnskov, Uffe. 1992. Cholesterol Lowering Trials in Co- dence. Washington, D.C. (February 1820, 2004).
ronary Heart Disease: Frequency of Citation and Outcome. Wade, Anne, Herbert M. Turner, Hannah R. Rothstein,
British Medical Journal 305(6844): 1519. and Julia G. Lavenberg. 2006. Information Retrieval and
Rosenthal, Marylu C. 1994. The fugitive literature. In The the Role of the Information Specialist in Producing High-
Handbook of Research Synthesis, edited by Harris Cooper Quality Systematic Reviews in the Social, Behavioral and
and Larry V. Hedges. New York: Russell Sage Foundation. Education Sciences. Evidence and Policy: a Journal of
Rothstein, Hannah R. 2006. Use of Unpublished Data in Research, Debate and Practice 2(1): 89108.
Systematic Reviews in the Psychological Bulletin 1995 Webb, Thomas L., and Paschal Sheeran. 2006. Does Chan-
2005. Working paper. ging Behavioral Intentions Engender Behavior Change? A
Rothstein, Hannah R., Alexander J. Sutton, and Michael Meta-Analysis of the Experimental Evidence. Psychologi-
Borenstein. 2005. Publication Bias in Meta-Analysis: Pre- cal Bulletin 132(2): 24968.
vention, Assessment and Adjustments. Chichester: John Weber, Ellen J, Michael L. Callaham, Robert L. Wears,
Wiley & Sons. Christopher Barton, and Gary Young. 1998. Unpublished
Rothstein, Hannah R., Herbert M. Turner, and Julia Laven- Research from a Medical Specialty Meeting: Why Investi-
berg. 2004. Information Retrieval Policy Brief. Philadel- gators Fail to Publish. Journal of the American Medical
phia, Pa.: International Campbell Collaboration. Association 280(3): 25759.
Scherer, Roberta W., Patricia Langenberg, and Erik von Elm. Weintraub, Irwin. 2000. The Role of Grey Literature in the
2005. Full Publication of Results Initially Presented in Sciences. http://library.brooklyn.cuny.edu/access/greyliter.
Abstracts. Cochrane Database of Methodology Reviews htm.
2005(2): MR000005. Weisz, John R., Carolyn A. McCarty, and Sylvia M. Valeri.
Schmidmaier, Dagmar. 1986. Ask No Questions and Youll 2006. Effects of Psychotherapy for Depression in Chil-
Be Told No Lies: Or How We can Remove Peoples Fear of dren and Adolescents: a Meta-Analysis. Psychological
Grey Literature. Librarian 36(2): 98112. Bulletin 132(1): 13249.
Smith, Mary Lee. 1980. Sex Bias in Counseling and Psy- Weisz, John R., Bahr Weiss, Susan S. Han, Douglas A.Gran-
chotherapy. Psychological Bulletin, 87(2): 392407. ger, and Todd Morton. 1995. Effects of Psychotherapy
Song, Fujian, Alison J. Eastwood, Simon Gilbody, Leila with Children and Adolescents Revisited: A Meta-Analysis
Duley, and Alexander J. Sutton. 2000. Publication and Re- of Treatment Outcome Studies. Psychological Bulletin
lated Biases. Health Technology Assessment 4(10): 1115. 117(3): 45068.
7
JUDGING THE QUALITY OF PRIMARY RESEARCH

JEFFREY C. VALENTINE
University of Louisville

CONTENTS
7.1 Introduction 130
7.1.1 Defining Study Quality 130
7.1.2 Campbells Validity Framework 130
7.1.3 Relevance Versus Quality 131
7.1.4 Role of Study Design 131
7.1.5 Study Quality Versus Reporting Quality 132
7.2 Options for Addressing Study Quality in a Research Synthesis 132
7.2.1 Community Preventive Services 132
7.2.2 What Works Clearinghouse 133
7.2.3 Social Care Institute for Excellence 133
7.3 Specific Elements of Study Quality 134
7.3.1 Internal Validity 134
7.3.2 Construct Validity 134
7.3.3 External Validity 134
7.3.4 Statistical Conclusion Validity 135
7.4 A Priori Strategies 135
7.4.1 Excluding Studies 135
7.4.1.1 Attrition 136
7.4.1.2 Reliability 136
7.4.2 Quality Scales 137
7.4.3 A Priori Strategies and Missing Data 138
7.4.4 Summary of A Priori Strategies 139
7.5 Post Hoc Assessment 139
7.5.1 Quality As an Empirical Question 139
7.5.2 Empirical Uses 139

129
130 CODING THE LITERATURE

7.5.3 Post Hoc Strategies and Missing Data 140


7.5.4 Summary of Post Hoc Assessment 140
7.6 Recommendations 140
7.6.1 Coding Study Quality 141
7.6.2 The Relevance Screen 144
7.6.3 Example Coding 144
7.7 Conclusion 144
7.8 References 144

7.1 INTRODUCTION assigned to not experience such change. All else being
equal, the closer a study on this question approaches the
Empirical studies vary in terms of the rigor with which ideal of a well-implemented randomized experiment, the
they are conducted, and the quality of the studies com- higher its quality.
prising a research synthesis can have an impact on the
validity of conclusions arising from that synthesis. How-
ever, the wide agreement on these points belies a funda-
7.1.2 Campbells Validity Framework
mental tension between two very different approaches in
research synthesis to addressing study quality and its im- In the social sciences, perhaps the most widely used sys-
pact on the validity of study conclusions. These ap- tem guiding thinking about the quality of primary studies
proaches are the focus of this chapter. In the sections that is that articulated by Donald Campbell and his colleagues
follow I provide a definition for the term study quality, (Campbell and Stanley 1966; Cook and Campbell 1979;
discuss two approaches for addressing study quality in a Shadish, Cook, and Campbell 2002). In the Campbell tra-
research synthesis, and conclude with specific advice for dition, validity refers to the approximate truth of an infer-
scholars undertaking a synthesis. ence (or claim) about a relationship (Cook and Campbell
1979; Shadish, Cook, and Campbell 2002). Although it is
7.1.1 Defining Study Quality commonly stated that an inference is valid if the evidence
on which it is based is sound and invalid if that evidence
One issue in applying the term quality to a study is that is flawed, the judgment about validity is neither dichoto-
the word can have several distinct meanings. As William mous nor one-dimensional, features that have important
Shadish pointed out, asking a university administrator to implications for the manner in which validity is used to
define the quality of a piece of research might lead to the assess study quality. Instead of being dichotomous and
use of citation counts as an indicator (1989). Asking a unidimensional, different characteristics of study design
school teacher the same question might lead one to exam- and implementation lead to inferences that are more or
ine whether the study resulted in knowledge that had less valid along one or more dimensions. Factors that
clear implications for practice. As such, the answer to the might lead to an incorrect inference are termed threats to
question What are the characteristics of a high quality validity or plausible rival hypotheses. Threats to validity
study? depends in part on why the judgment is being can be placed into four broad categories or dimensions:
made. In this chapter, I use quality to refer to the fit be- internal validity, external validity, construct validity, and
tween a studys goals and the studys design and imple- statistical conclusion validity.
mentation characteristics. As an example, if a scholar Internal validity refers to the validity of inferences
would like to know if the beliefs teachers hold about the about whether some intervention has caused an observed
academic abilities of their students (that is, teacher ex- outcome. Threats to internal validity include any mecha-
pectancies) can cause changes in the IQ of those students, nism that might plausibly have caused the observed out-
then the best way to approach this question is through a come even if the intervention had never occurred. Exter-
well-implemented experiment in which some teachers are nal validity refers to how widely a causal claim can be
randomly assigned to experience change in their expecta- generalized from the particular realizations in a study to
tions about a student and other teachers are randomly other realizations of interest. That is, even if we can be
JUDGING THE QUALITY OF PRIMARY RESEARCH 131

relatively certain that a study is internally validmeaning tered when considering the operational definitions that
we can be relatively certain that observed changes in the were employed in the studies that might be relevant to the
outcome variable were caused by the interventionwe research synthesis, and hence largely address issues of
still dont know if the results of the study will apply, for construct validity. As an example, David DuBois and his
example, to different people in different contexts using colleagues operationally defined mentoring, in part, as an
different treatment variations with outcomes measured in activity that occurs with an older mentor and a single
different ways. Construct validity refers to the extent to younger mentee, that was meant to focus on positive
which the operational characteristics of interventions and youth development in general, and to occur over a rela-
outcome measures used in a study adequately represent tively long period (2002). This operational definition of
the intended abstract categories. Researchers most often mentoring was grounded in theoretical work in the area
think of construct validity in terms of the outcome mea- of youth development. During the literature search, the
sures. It also refers, however, to the adequacy of other researchers found a study in which an adult was called
labels used in the study, such as the intervention and the a mentor; this adult met with children a few times in
labels applied to participants (for example, at-risk). A groups of two or three, and focused on helping them with
measure that purports to measure intelligence but actu- their homework. Because of these characteristics, the ac-
ally measures academic achievement, for example, is tivity the authors describedeven though they used the
mislabeled and hence presents a construct validity prob- term mentoringdid not fit the definition the synthesists
lem. Finally, statistical conclusion validity refers to the used. This study is an example of how relevance and
validity of statistical inferences regarding the strength quality decisions sometimes cannot be separated. Ulti-
of the relationship between the presumed cause and the mately, it was not included in the synthesis because it was
presumed effect. Assume, for example, that a researcher deemed irrelevant. However, this decision could be la-
manipulates teacher expectancies and, after conducting beled a quality judgment because the original studys au-
an analysis of variance, infers that if a teacher alters her thors used the term in a way that was not consistent with
academic expectancy about a student to be more positive the common usage among youth development scholars,
it causes that student to achieve at higher levels. If the and hence presented a definitional (that is, construct va-
data that led to this inference did not meet the assump- lidity) problem.
tions underlying the analysis of variance, then the infer-
ence may be compromised. 7.1.4 Role of Study Design

7.1.3 Relevance Versus Quality Before collecting data, research synthesists should decide
what kinds of evidence will be included in the synthesis.
From the outset, it is important to draw a distinction be- The answer to this question is often complex, necessitat-
tween relevance and quality. Relevance refers to study ing a thoughtful consideration of the trade-offs involved
features that pertain to the decision about whether to in- in each type of design the team is likely to encounter. For
clude a given study in a research synthesis. These fea- example, researchers interested in synthesizing the re-
tures may or may not relate to the fit between a studys sults of multiple political polls must decide whether to
goals and its methods used to achieve those goals. As an exclude all studies that did not use a random sampling
example, assume a team of scholars is interested in the technique, because in the absence of random sampling it
impact of teacher expectancies on the academic achieve- is not clear whether the sample can be expected to be
ment of low socioeconomic status (SES) students. If the similar to the population (an external validity issue). Sim-
researchers were to encounter a study that focuses on ilarly, when random assignment is used to allocate par-
wealthy students, they would not include that study in ticipants to study groups in studies of the causal effects of
their synthesis because its focus on wealthy students an intervention, the groups are expected to be equivalent
made it irrelevant to their goal. This would be true regard- on all dimensions (an internal validity issue). As such,
less of fit between the studys goals and the methods used researchers synthesizing the effects of an intervention
to achieve those goals, and as such researchers should be must decide whether to exclude all studies that did not
careful not to frame it as a quality issue. use random assignment.
There is, however, a sense in which relevance and qual- Empirical evidence also suggests that not using ran-
ity are intertwined. These issues will usually be encoun- domization when it would be appropriate to do so can
132 CODING THE LITERATURE

lead to biased results. In medicine, for example, a body of present. Study authors may fail to include important
evidence suggests that studies relying on nonrandom al- estimates of measurement reliability and validity, for ex-
location methods are positively biased relative to random ample. It could also be that the authors of the study sim-
allocation. Thomas Chalmers and his colleagues, for ex- ply thought the information was irrelevant. Perhaps they
ample, found that studies using nonrandom allocation chose not to present the information thinking that it re-
were about three times more likely to have at least one vealed a weakness of the study. It is also possible that the
baseline variable that was not equally distributed across information was intended for inclusion but eliminated by
study groups; baseline variables also were about two times an editor due to space considerations. It is not always
more likely to favor the treatment group (1983). These clear how a synthesist should proceed in these situations.
findings are highly suggestive of an assignment process Regardless, there is no escaping the reality that when
that stacked the deck in favor of finding treatment effects. synthesists attend to study quality issues they will find
Taking a somewhat different approach, Graham Colditz, some of the information they want is missing and they
James Miller, and Frederick Mosteller found, through a will not know if this is because of poor study quality or
synthesis of studies reported in medical journals compar- poor reporting, or perhaps both. Requesting additional
ing a new treatment to a standard treatment, that studies information from study authors should be the first ap-
using a nonrandom sequential allocation scheme yielded proach to dealing with this problem.
treatment effects that were modestly larger (that is, fa-
vored the new treatment) than those using random assign- 7.2 OPTIONS FOR ADDRESSING STUDY QUALITY
ment (1989). In a separate study, this time looking spe- IN A RESEARCH SYNTHESIS
cifically at surgical interventions, they again found a
small bias favoring nonrandomized relative to random- Generally, researchers undertaking a synthesis have used
ized trials (Miller, Colditz, and Mosteller 1989). a blend of two strategies for addressing study quality: a
For synthesists working on research questions that priori strategies that involve setting rules for determining
would be appropriate for randomization, there is no easy either the kinds of evidence that will be included or how
blanket answer to the question: Should we exclude all different evidence will be weighted, and post hoc strate-
studies that do not employ randomization? Relatively gies that treat the impact of study quality as an empirical
little empirical work has been done to investigate the question.
specific selection mechanisms operating in different con- The importance of how to address quality in research
texts. Further, the answer to this question probably de- synthesis is indicated by the number of organizations that
pends heavily on the context of the research question and have become involved in setting standards for studies in-
the types of research designs that have been used to ad- vestigating causal relations. A thorough description of
dress it. It is possible, for example, that in some contexts these is beyond the scope of this chapter, but three useful
nonrandom selection mechanisms leading to an overesti- illustrations are the work of the Task Force on Community
mate of the population parameter in one study might be Preventative Services (Bris et al. 2000; Zara et al. 2000) as
balanced by different nonrandom selection mechanisms evident in the Guide to Community Preventative Services,
leading to an underestimate of the population parameter the standards of evidence established by the What Works
in another study (Glazerman, Levy, and Myers 2003; Clearinghouse (2006), and the quality assessment criteria
Lipsey and Wilson 1993). It is also possible that selection established by the Social Care Institute for Excellence
mechanisms will operate to relatively consistently over- (Coren and Fisher 2006). I discuss each of these in turn.
or underestimate (bias) the population parameter (for
example, Shadish and Ragsdale 1996). Specific recom- 7.2.1 Community Preventative Services
mendations for addressing this problem are presented in
section 7.6 of this chapter. The Task Force on Community Preventative Services de-
veloped methods and standards to be used in evaluating
7.1.5 Study Quality Versus Reporting Quality and synthesizing research in community prevention (Bris
et al. 2000; Zara et al. 2000). For studies that pass a de-
A final consideration is the need to draw a distinction sign suitability screen, the Task Force set up a three-tiered
between study quality and reporting quality. In many grading system that characterizes the quality of the study
studies, information vital to a research synthesist is not execution; the categories are good, fair, and limited. Stud-
JUDGING THE QUALITY OF PRIMARY RESEARCH 133

Table 7.1 Quality Assessment for Guide to Community Preventive Services


Quality Category Example Items

Descriptions (study population and intervention) Was the study population well described?
Sampling Did the authors specify the sampling frame or universe of selection for
the study population?
Measurement (exposure) Were the exposure variables valid measures of the intervention under
study?
Measurement (outcome) Were the outcome and other independent (or predictor) variables
reliable (consistent and reproducible) measures of the outcome of
interest?
Data analysis Did the authors conduct appropriate analysis by conducting statistical
testing (where appropriate)?
Interpretation of Results: Participation Did at least 80% of enrolled participants complete the study?
Interpretation of Results (comparability and bias) Did the authors correct for controllable variables or institute study
procedures to limit bias appropriately?
Interpretation of Results (confounders) Describe all potential biases or unmeasured/contextual confounders
described by the authors.
Other Other important limitations of the study not identified elsewhere?
Study Design Concurrent comparison groups and prospective measurement of
exposure and outcome

source: Authors compilation; Zara et al. 2000.


note: Adequacy of study design is assessed separately.

ies revealing zero to one limitation are characterized as with equating, a regression discontinuity design, or an
good, two to four as fair, and five to nine as limited. Nei- appropriate single case design. They must not experience
ther those that fail the design screen nor those judged as excessive loss of participantsthat is, attrition, either
limited are used to develop recommendations, hence are overall or differential. They also must not confound
not used in the synthesis that guides the development of teacher effects with intervention effects, must use equat-
recommendations. The nine dimensions relate to a vari- ing procedures judged to have a reasonable chance of
ety of aspects of study design and implementation, and equating the groups, and must not experience an event
are presented in table 7.1. that contaminates the comparison between study groups.

7.2.2 What Works Clearinghouse 7.2.3 Social Care Institute for Excellence
The What Works Clearinghouse (WWC) developed stan- The Social Care Institute for Excellence (SCIE) commis-
dards to be used in evaluating and synthesizing research in sions systematic reviews with the aim of improving the
education (2006). To assess the quality of interventions, quality of experiences by individuals who use social
the WWC set up a three-tiered grading system which char- care. To address study quality, SCIE established a frame-
acterizes the quality of study execution. The categories are work of quality dimensions, although synthesists may
meets evidence standards, meets evidence standards with choose to use a different system. The dimensions ad-
reservations, and does not meet evidence standards. To be dressed in SCIEs system are relevance of study parti-
included in a WWC review, studies must first use a re- culars, appropriateness of study design, quality of report-
search design appropriate for studying causal relations ing and analysis, and generalizability. Unlike the Task
that is, a randomized experiment, a quasi-experiment Force on Community Preventive Services and the WWC,
134 CODING THE LITERATURE

SCIE does not screen out evidence based on these scores. to the concealment of the allocation schedule from the
Rather, SCIE requires synthesists to weight study results individuals enrolling participants into trials as an integral
by their quality scores. part of protecting a random allocation scheme. Kenneth
Schulz and his colleagues, for example, found that treat-
7.3 SPECIFIC ELEMENTS OF STUDY QUALITY ment effects were exaggerated by about one-third in stud-
ies with unclear or inadequate concealment relative to
The specific elements of study quality that will require experiments in which the allocation schedule was pro-
the attention of the research synthesist will of course vary tected (1995). This aspect of study quality has received
from topic to topic. However, regardless of the research virtually no attention in the social sciences, however.
question, Campbells validity framework provides a con-
venient way of thinking about the dimensions of quality 7.3.2 Construct Validity
that ought to be considered. Many of the considerations
outlined are represented in at least one of the three sys- Construct validity pertains to the adequacy of the labels
tems for evaluating study quality just identified. used in a study and as such, is relevant to virtually all
studies. For outcome measures, synthesists should attend
7.3.1 Internal Validity to indices of score reliability (see section 7.3.2.2) and
validity (here, the extent to which a measure actually
By definition, internal validity is only a concern in stud- measures what it is supposed to), as well as the method
ies in which the goal is to investigate causal relations. used to assess the construct (for example, self-report vs.
Several different mechanisms are potentially plausible parent-report of grades). Synthesists should also attend to
threats to validity in these studies. Section 7.1.4 already the labels applied to study participants. Some of these
addressed the important role of random assignment to will be relatively straight-forward (for example, age and
conditions. In addition, information about attrition (data sex), but many will not. For example, if a team of synthe-
loss and exclusions; see section 7.4.1) and crossover (par- sists is interested in investigating the effects of teacher
ticipants change study conditions after random assign- expectancies on students labeled at-risk, they will likely
ment) should be gathered from each study. In some cases, find a wide range of methods that study authors have used
studies might report conducting an intention-to-treat to assign risk status; some of these methods may be more
analysis to address problems of attrition and crossover. valid than others.
Ideally, in an intent-to-treat analysis all participants are Finally, synthesists examining the effects of an inter-
included in the analysis in their originally assigned con- vention should attend to several aspects of the interven-
ditions, regardless of attrition or cross-over. Unfortu- tion. This might include, for example, an analysis of the
nately, there appears to be little consistency in the use of extent to which the intervention as it was implemented
the term intention-to-treat analysis (Hollis and Campbell matches the interventions label. Synthesists might also
1999), so synthesists need to carefully examine reports determine whether intervention fidelity was monitored
that use the term to determine what exactly was done. and if so, how faithfully the intervention was judged to
In studies that do not use random assignment to condi- have been implemented.
tions, synthesists should carefully attend to the baseline
comparability of groups and to any procedures used by the 7.3.3 External Validity
researchers to attempt to equate groups. They should also
be aware of increased risk for history, maturation, statisti- A sampling framework in which units are randomly cho-
cal regression, and testing threats to internal validity in- sen from a carefully specified population provides the
herent in these designs (see Shadish, Cook, and Campbell strongest basis for making inferences about the generaliz-
2002). Unfortunately, studies in the social sciences often ability of study findings. Unfortunately, except in survey
report little information on these threats. Nonetheless, the research, randomly selecting a sample from a population
synthesist should seriously consider how these threats of potential participants is relatively rare across the social
might reveal themselves in the context of the research and medical sciences, and even when random sampling
question, and should read and code studies with the inten- is used it rarely extends to study characteristics beyond
tion of identifying these problems when they exist. the individuals or other units of interest. These charac-
In the medical sciences, recent attention has been paid teristics mean that unless the synthesist is working in an
JUDGING THE QUALITY OF PRIMARY RESEARCH 135

area in which random sampling occurs frequently, study statistically significant results. This will in all likelihood
quality concerns related to external validity receive rela- overestimate the magnitude of the effect of the interven-
tively less attention. However, synthesists should care- tion on the construct of IQ. When effects cannot be esti-
fully consider and code elements of the study that might mated for measured outcomes synthesists should make
help readers make rational and empirically based deci- every effort to retrieve the information from study au-
sions about the likely external validity of the findings of thors. Unfortunately, authors often do not exhaustively
both individual studies and the research synthesis (see report the outcomes they measured, so to some extent
chapter 28, this volume). this problem is likely to be invisible to synthesists (Wil-
liamson et al. 2005). Increasing use of prospective re-
7.3.4 Statistical Conclusion Validity search registers should help synthesists diagnose the
problem more effectively in the future (see chapter 6, this
In a research synthesis, the results of individual studies volume).
are represented as effect sizes (see chapters 12 and 13,
this volume). Effect sizes are not based on the probability 7.4 A PRIORI STRATEGIES
values arising from the results of statistical tests in those
individual studies, and as such research synthesists are 7.4.1 Excluding Studies
usually only interested in some of the threats to statistical
conclusion validity usually identified in the Campbell tra- As evident in the quality assessment frameworks estab-
dition. For example, both low statistical power and too lished by the Task Force on Community Preventive
many statistical tests can threaten the validity of statisti- Medicine and the What Works Clearinghouse, one com-
cal conclusions at the individual study level. However, mon approach to addressing study quality in a research
these are usually not of primary interest to the research synthesis is simply to exclude studies that have certain
synthesist because they respectively affect type II and characteristics. Returning, as an example, to the random
type I error rates but do not bias effect sizes themselves. assignment of participants to conditions, synthesists who
Other threats to statistical conclusion validity can af- have relevant evidence or are operating under the as-
fect the estimation of the magnitude of the studys effects sumption that a lack of random assignment will bias a
or the stated precision with which the effects are esti- body of evidence may choose to exclude all nonrandom-
mated, or both. For example, violating the assumptions ized comparisons. Less extreme, other synthesists may
underlying the estimation of an effect size can lead to exclude studies that rely on a within-groups comparison
either an overestimate or an underestimate of the popula- (for example, one group pretest-posttest designs) to es-
tion effect. As such, synthesists should code studies for timate the treatment effect but include studies that use
evidence pertaining to the assumptions underlying the intact groups and attempt to minimize selection bias
effect size metric they are employing. Generally how- through matching or statistical controls. Other character-
ever, violations of the equinormality assumption (that is, istics sometimes used to exclude studies are small sample
populations that are normally distributed and have equal size, high attrition rates, poor measurement characteris-
variances) do not have a great impact on effect size esti- tics, publication status (for example, include only pub-
mation, although the potential for problems becomes lished studies), and poor intervention implementations.
larger as sample size decreases (Fowler 1988). A violation Ideally, researchers will attend to three concerns about
of the assumption of independence can result in a some- their adopted rules. First, they need to adopt their rules
times dramatic overstatement of the precision of the effect before examining the data. Obviously, adopting a rule for
size estimate, even when the degree of violation is rela- excluding evidence after the data have been collected in-
tively small. As such possible multilevel data structures creases the likelihood (and perception) that the decision
for example, students in classrooms and classrooms in has been made to exclude studies that did not support the
school buildingsneed to be investigated in every study synthesists existing beliefs and preferences. To the ex-
(Hedges, 2007; Murray 1998). tent possible, therefore, synthesists should avoid making
Further, selective outcome reporting can lead to biased these types of decisions after the outcomes of studies are
estimation of effect size (see also chapter 6, this volume). known. If the need for such a rule becomes apparent after
As an example, assume researchers measure IQ two ways the data have been collected, the synthesists should be
but report only results for the one IQ measure that yielded clear about the timing of and rationale for the decision.
136 CODING THE LITERATURE

Further, they should consider conducting a sensitivity or both. Further, the timing of attrition may matter. For
analysis to test the implications of the decision rule (see example, assume that some loss of participants occurs
chapter 22, this volume). after participants have been identified for the study but
Second, researchers should operationally define their before they have been randomly assigned to conditions,
exclusion rules. For example, it is not enough to state that because some identified participants decline to be part of
studies with too much attrition were excluded because the experiment. This loss has implications for external
readers have no way of knowing how much attrition is too validity (results can be generalized only to participants
much, and as such cannot judge the adequacy of this rule like those who have agreed to participate), but not for in-
for themselves. Further, absent an operational definition ternal validity (because only participants who consented
of too much attrition it becomes less likely that the indi- were randomized). As such, if a synthesist is screening
viduals coding studies for the synthesis will consistently studies for attrition because of a concern about the effects
apply the same rule to all studies, leading to study ratings of attrition on internal validity, that synthesist would need
(and hence, inclusion decisions) that have less agreement to be careful not to exclude studies that experienced attri-
than is desirable. So, for example, researchers wishing to tion before assignment to conditions occurred. More gen-
exclude studies with too much attrition might operation- erally, the timing of the participant loss is likely to be
ally define too much as more than 30 percent loss from important in many research contexts.
the original sample. Next, the synthesists have to decide how much attrition
Finally, synthesists should provide empirical or theo- is too muchthat is, the point at which nonzero attrition
retical evidence that their rules function to remove bias leads to exclusion. Setting a specific point is arbitrary and
from the body of studies or otherwise improve the inter- likely to lead to instances in which the decision rule pro-
pretability of the results of the synthesis. Regrettably, duces unsatisfactory results. For example, suppose a syn-
empirical evidence rarely exists in many research con- thesist sets 30 percent or greater overall attrition as the
texts and for most characteristics of study design and criterion for excluding studies. If one study lost 29 per-
implementation. This creates a problem for synthesists cent of its participants it would be included. If a second
wishing to employ a priori rules. In the absence of em- study lost 30 percent of its participants it would be ex-
pirical evidence, the belief that the rules adopted serve to cluded. However, most scholars would agree that these
increase the validity of the findings has little basis. two studies experienced qualitatively indistinct degrees
If more attention were paid to the last two requirements of overall attrition.
(operationally defining the exclusion criteria and provid- The essential problem facing synthesists who wish to
ing empirical evidence for them), fewer a priori exclusion use attrition as an exclusion criterion is that an informed
criteria would be adopted. The process of operationally judgment on where to place the cutoff requires a reason-
defining features of study design and implementation that able guess about the treatment outcome that would have
will lead to exclusion highlights the arbitrary nature of been observed for the individuals who dropped out of the
such rules, and the lack of empirical evidence justifying study had they remained in the study. This information is
them. To illustrate the difficulties in more detail, I discuss not knowable in most contexts. Therefore, in most con-
two dimensions on which evidence might be excluded: texts, little empirical evidence exists to help guide the
attrition and reliability. decision about where to place the cutoff. For example,
7.4.1.1 Attrition Attrition occurs when some portion Jeffrey Valentine and Cathleen McHugh examined the
of participants who begin a study (for example, are ran- impact of attrition in studies in education, mostly studies
domly assigned to treatment conditions) are not available that took place in classrooms (2007). Although the find-
for follow-up measurements. A distinction can be made ings were complex, these researchers found no evidence
between overall attrition, which refers to the total attri- to justify a specific cutoff to guide exclusion decisions.
tion across study groups, and differential attrition, which 7.4.1.2 Reliability Like attrition, scholars will some-
occurs when the characteristics of the dropouts differ times only include a study if the outcome measure is
across study groups. (Note that researchers often use dif- reliable, or generally consistent in measurement. This cri-
ferences in dropout rates as a proxy for differences in par- terion reflects a fundamental misunderstanding about the
ticipant characteristics.) Therefore, synthesists wishing nature of reliability. Reliability is a property of scores,
to use attrition as a basis for exclusion must first decide not of measures, and therefore depends on sample and
whether they mean overall attrition, differential attrition, context (Thompson 2002). A measure yielding good reli-
JUDGING THE QUALITY OF PRIMARY RESEARCH 137

ability estimates with one sample might yield poor ones determine what evidence gets included. Unfortunately,
with a sample from a different population, or even with quality scales generally have little to recommend them,
the same sample under different testing conditions. This largely because they fail the two tests just outlined. That
misunderstanding is evident when researchers write that is, they generally lack adequate operational specificity
the measure has been shown to have acceptable reliability, and are based on criteria that have no empirical support.
provide a citation, and then fail to examine the reliability In a demonstration of these problems, Peter Jni and his
of the scores obtained. This error is further compounded colleagues applied twenty-five quality scales to seventeen
when synthesists set a threshold for reliability, then rely studies reporting on trials comparing the effects of low-
on the citation to help them determine whether they molecular weight heparin (LMWH) to standard heparin
should include the measure in their synthesis. Further, as the risk of developing blood clots after surgery (1999).
with attrition, little empirical evidence supports a rule in The authors then performed twenty-five meta-analyses,
which a precise threshold for acceptable reliability is es- examining in each case the relationship between study
tablished. quality and the effect of LMWH (relative to standard
When synthesists use score reliability as an exclusion heparin). They then examined the conclusions of the
criterion, their concern might be about validity. In the meta-analyses separately for high and low quality trials.
context of measurement, validity is the extent to which a For six of the scales, the high quality studies suggested
measure actually measures what it is supposed to. As an no difference between LMWH and standard heparin, but
example, validity addresses the question of whether the the low quality studies suggested a significant positive
test is a good measure of intelligence. Unfortunately, reli- effect for LMWH. For seven other quality scales, this pat-
ability and validity are not necessarily related. Although tern was reversed. That is, the high quality studies sug-
measurement reliability does set a limit on the size of the gested a positive effect for LMWH, and the low quality
correlation that can be observed between two measures, it studies suggested no difference. The remaining twelve
is possible for scores to be reliable on a measure that has scales showed no differences. Thus, Jni and his col-
no validity, or to have no reliability using a particular reli- leagues suggested that the clinical conclusion about the
ability metric for a measure that is perfectly valid. An efficacy of the two types of heparin depended on the qual-
example of the latter would be demonstrated by a 100- ity scale used.
item questionnaire with no internal consistency but per- The dependence between the outcomes of the synthe-
fect validity if each item had a zero correlation with every sis and the quality scale used reveals a critical problem
other item but a correlation of r  .10 with the construct with the scales. A team of scholars could have chosen
(Guilford 1954). As a somewhat less extreme example, quality scale A, used it to exclude low quality studies,
suppose that students in a pharmacology class are given a and then concluded that LMWH is effective. Starting
test of content knowledge at the beginning of the semes- with the same research question, another team could have
ter and at the end of the semester. Assume that on the chosen scale B, used it to exclude low quality studies,
pretest the students know nothing about pharmacology then concluded that LMWH is no more effective than
(their responses are, in effect, random) but on the posttest standard heparin. As such, the scales appear to have been
they all receive scores in the passing range, that is, 70 to at best useless and at worst misleading (Berlin and Ren-
100 percent correct. Test-retest reliability would suggest nie 1999).
very little consistency. However, the test itself would Examination of the quality scales used makes it easy to
seem to be measuring content taught during the course understand why this result occurred. The twenty-five
(indicating good validity). Even though these examples scales often focused on different dimensions or aspects of
appear relatively implausible, they illustrate the point that research design and implementation (for example, some
reliability and validity are not necessarily related; there- focused on internal validity alone and others focused on
fore, a specific reliability coefficient should not be viewed all four classes of validity), and analyzed a number of
as a proxy for validity. dimensions (from three to thirty-four), including dimen-
sions unrelated to validity in the traditional sense (for ex-
7.4.2 Quality Scales ample, whether an institutional review board approved
the study). Even when the same dimension was analyzed,
As an alternative to establishing various a priori exclu- the weights assigned to the dimension varied. For exam-
sion rules, some scholars have used quality scales to ple, one scale allocated 4 percent of its total points to the
138 CODING THE LITERATURE

presence of random assignment, and 12 percent of the ORourke (2001), it may be that due to the way selection
points to whether the outcome assessor was unaware of mechanisms operate in a given area, the lack of random
the condition the participants were in (called masking). assignment of units results in an overestimate of the treat-
Another scale very nearly reversed the relative impor- ment effect, and both overall attrition and differential at-
tance of these two dimensions, allocating 14 percent of trition result in an underestimate of the treatment effect.
the total points to randomization and 5 percent to mask- That is, different quality indicators might be associated
ing. In sum, there is no empirical justification for these with both different directions and degrees of bias. Thus,
weighting schemes, and the result in the absence of that a study that does not randomly assign units but does ex-
justification is a hopelessly confused set of arbitrary perience overall attrition could experience two sources of
weights. bias that act in opposite directions. Unless a quality scale
Further, the scales Jni and his colleagues reviewed all captures these directions, the resulting quality score will
rely on single scores to represent a studys quality (1999). not properly rank studies in terms of their degree of bias
Especially when scales focus on more than one aspect of (Greenland and ORourke 2001). Again, one problem is
validity, the single score approach results in a score that that in most areas of inquiry, we simply do not know
is summed across very different aspects of study design enough to be able to predict the direction of bias associ-
and implementation, many of which are not necessarily ated with any given quality indicator. Another problem is
related to one anotherfor example, between the validity that even if we had this information, quality scales as cur-
of outcome measures and the mechanism used to allocate rently constructed do not capture the possibility that bias
units to groups. When scales combine disparate elements might cancel out across a set of potentially biasing design
of study design into a single score, it is likely that impor- and implementation features.
tant considerations of design are being obscured. A study
with strong internal validity but weak external validity 7.4.3 A Priori Strategies and Missing Data
might have a score identical to a study with weak internal
validity and strong external validity. If the quality of these I mentioned earlier that synthesists will sometimes be un-
studies is expressed as a single number, how would users able to determine a studys status on a given quality indi-
of the synthesis know the difference between studies with cator. The best case scenario for missing data occurs
such different characteristics? when the synthesists can make relatively safe assump-
To address these problems, Jeffrey Valentine and Har- tions about the quality dimension when that information
ris Cooper created the Study Design and Implementation is missing. For example, if a study does not mention that
Assessment Device (Study DIAD) in conjunction with a participants were randomly assigned to groups, it is often
panel that included experts in the social and medical sci- reasonable to assume that some process other than ran-
ences (2008). The Study DIAD is a multilevel and hierar- dom assignment was used.
chical instrument that results in a profile of scores rather Unfortunately for synthesists using quality criteria as
that arriving at a single score to represent study quality. In screens for determining whether include a particular piece
addition, in recognition of the fact that important quality of evidence, most decisions are not so straightforward. As
features may vary across research questions, the instru- such, synthesists will have to make difficult choices, often
ment was designed to allow for contextual variation in with little or no empirical evidence to guide their deci-
quality indicators. Further, it was written with explicit at- sions. For example, assume a synthesist chooses to ex-
tention to operationally defining the quality indicators, clude nonrandomized experiments, as well as studies in
something notably absent from most quality scales. which overall attrition was too high (arbitrarily defined as
The Study DIAD is not a complete solution to the greater than 20 percent) or differential attrition was too
problems faced by individuals wishing to use a quality high (arbitrarily defined as greater than a 10 percent dif-
scale as part of the study screening process. It requires a ference). The synthesist might feel comfortable exclud-
great deal of both methodological and substantive exper- ing studies that are silent about the process used to assign
tise to implement, and relative to other scales is human participants to groups, under the assumption that if the
resource intensive. In addition, all quality scales that I process was random the authors are highly likely to have
have seen, including the Study DIAD, pay no attention to mentioned it. Assumptions about attrition, however, are
the direction of bias associated with the quality indicator. probably not as safe in most research contexts. Probably
To borrow an example from Sander Greenland and Keith the best the synthesist can do is to be explicit about what
JUDGING THE QUALITY OF PRIMARY RESEARCH 139

rule was used and the rationale for using it, so that others different results might bias might be associated with non-
can critically evaluate that decision. random assignment.
The virtue of this approach is that it relies much less on
7.4.4 Summary of A Priori Strategies arbitrary decision rules for determining what evidence
ought to be counted. It has, however, has two potential
Earlier I referred to validity as the approximate truth of drawbacks. First, like any other inferential test, tests of
an inference, and noted that although we often refer to moderators related to quality in a research synthesis
studies as high or low validity, in reality, validity is not a should be conducted under conditions of adequate statis-
dichotomous characteristic of studies but rather a contin- tical power (see Hedges and Pigott 2001, 2004). A failure
uous characteristic of inferences. The continuous nature to reject the null hypothesis is much more persuasive if
often makes using the concept in a dichotomous fashion the researchers can claim to have had good power to de-
difficult, such as deciding that some particular cutoff tect a meaningful difference. It is less persuasive if re-
point on some criterion should lead to a studys exclu- searchers were relatively unlikely to reject a false null
sion. As a further complication, there is often little evi- hypothesis due to low power. As such, the viability of the
dence or strong theory supporting either a specific cutoff empirical approach to the assessment of the impact of
point or even a general rule (for example, attrition biases study quality depends greatly on having enough informa-
a body of literature) for determining when a particular tion to conduct fair (that is, well-powered) tests of the
design or implementation problem warrants excluding a critical aspects of quality.
study from a synthesis, nor do most fields have a substan- A second potential problem is that, in all likelihood,
tial evidence base on which to make predictions about the the quality characteristics were not randomly assigned to
direction of bias for many quality indicators. These char- studies. This problem relates to what Cooper refers to as
acteristics make the use of a priori strategies questionable the distinction between study-generated evidence and
in many areas of inquiry and for most criteria. Ironically, synthesis-generated evidence (1998; see also chapter 2,
scholars who rely on a priori strategies may be excluding this volume). Evidence generated from study-based com-
the evidence that might help them begin to establish the parisons between groups composed through random as-
rules. signment of units permit causal inferences. Evidence
generated from synthesis-based comparisonsthat is,
7.5 POST HOC ASSESSMENT comparisons between studies with different design char-
acteristicswas not constructed using random assign-
7.5.1 Quality As an Empirical Question ment and therefore yields evidence about association
only, not about causality. Study features will likely be in-
Because of the many problems associated with a priori terrelated. For example, studies using random assignment
approaches to addressing study quality, some scholars may tend to be of shorter duration than studies using non-
adopt a post hoc approach to examining the impact of random methods to create groups. Further, it is generally
design and implementation quality on study results. Typi- not possible to rule out a spurious relationship between a
fied by Mary Smith and Gene Glasss meta-analysis of quality indicator, the studys estimated effect size, and
the effects of psychotherapy, one method involves treat- some unknown third variable. As a result, it is often dif-
ing the impact of design and implementation features of ficult to tease apart the effects of the interrelated quality
study outcomes as an empirical question (1977; see also indicators and unknown third variables. Even when
chapter 10, this volume). Using analogs to the analysis of scholars have enough data for well-powered comparisons
variance and multiple regression, synthesists using this using techniques such as meta-regression, ambiguity about
strategy attempt to determine statistically whether certain how to interpret the results will often be considerable.
study characteristics influence study results. A team of
synthesists might compare the results of studies using 7.5.2 Empirical Uses
random assignment to those using nonrandom methods.
If studies using random and nonrandom methods arrive If researchers can find or create a quality scale that avoids
at similar results, this might be taken as evidence that the problems noted, it might make sense to use quality
non random assignment did not result in consistent bias. scores in the empirical assessment of the impact of quality
On the other hand, if these assignment methods yield on study outcomes. For example, a quality scale focused
140 CODING THE LITERATURE

on limited dimensions of internal validity might result in 7.6 RECOMMENDATIONS


a score that could be used as a moderator in statistical
analyses. If studies scoring well on that scale yielded dif- I have outlined two broad strategies for dealing with the
ferent study outcomes than those not scoring well, it near certainty that studies relevant to a particular synthe-
might be an indication that an internal validity problem sis will vary in terms of their research designs and imple-
was biasing study outcomes. Note, however, that as mentation. Although most synthesists adopt a blend of the
pointed out earlier, if the different quality indicators af- two strategies, some rely more on a priori strategies and
fect the direction of bias differently, a summary of those others are more comfortable with post hoc. Neither ap-
indicators is unlikely to be a valid indicator of quality un- proach is without problems. So, how should synthesists
less the instrument can capture the complexity. approach the issue of study quality? One approach is to
use a few highly defensible criteria for excluding some
7.5.3 Post Hoc Strategies and Missing Data studies. For example, a synthesist might be working in a
field in which it seems clear that differential attrition pos-
Researchers who adopt a largely empirical approach to itively biases estimates of effect. Ideally, the evidence for
investigating the effect of study quality on study out- this assertion would be closely tied to the research context
comes find themselves in a somewhat better position than in which the synthesist is working. It would not be enough,
a priori synthesists to address missing data. For example, for example, for a synthesist to assert that attrition results
they may be able to compare effect sizes for studies in in a consistent bias, because that claim is neither specific
which random assignment clearly occurred to those in nor tied to the synthesists research question. A better as-
which it clearly did not and to those in which it was sertion might be that research has shown that, in class-
unclear whether it did occur. Similar to their a priori room studies of the type we are examining, differential
counterparts, they could also adopt a rule governing what attrition results in a consistent underestimate of the inter-
should be assumed if a quality indicator is missing. Then, vention effect, because high-need students are more likely
as a sensitivity analysis, they could eliminate those stud- to drop out of the comparison condition. Then, other char-
ies from the analysis and examine the impact of the acteristics of study quality that lack a solid empirical jus-
decision rule on the statistical conclusion. This approach tification could be coded and addressed empirically. This
provides a flexible method of dealing with the inevitable approach allows synthesists to screen out studies that the
problem of missing data and is another point favoring the empirical evidence suggests will bias the results of a syn-
post hoc approach. thesis yet allow them to empirically test the impact of
other study features lacking a reasonable empirical basis.
7.5.4 Summary of Post Hoc Assessment For synthesists investigating causal relations, one im-
portant consideration that deserves special attention is
The post hoc approach to addressing study quality in a whether or not to include studies in which units were
research synthesis treats the impact of study quality as an not randomly assigned to conditions. Again, the relevant
empirical question. This strategy has a number of bene- issue pertains to the effect that bias at the level of the in-
fits. Most notably, it relies less on ad hoc, arbitrary, and dividual study has on the estimation of effects in a collec-
unsupported decision rules. It also provides the synthe- tion of related studies. Because the nature of the relation
sists with the ability to empirically test relationships that between assignment mechanism and bias is likely to de-
are plausibly related to study quality. However, when a pend heavily on the research context, researchers should
literature does not have enough evidence to conduct a fair strongly consider collecting both randomized and non-
statistical test (that is, one with sufficient statistical power randomized experiments and empirically investigating
to detect an effect of some size deemed important) schol- both the selection mechanisms that appear to be operat-
ars will not be in a good position to evaluate the impact of ing in the nonrandomized experiments and the relation-
quality on their conclusions. Further, even if the impact ship between assignment mechanism and study outcomes.
of quality can be investigated using credible statistical However, averaging effects across the assignment mecha-
tests, the analyses will almost certainly be correlational, a nisms is something that should be done very cautiously, if
fact limiting the strength of the inferences that can be at all (Shadish and Myers 2001).
drawn about the relation between study quality and study Of course, there are exceptions. If a team of synthe-
outcomes. sists has the luxury of a large number of randomized
JUDGING THE QUALITY OF PRIMARY RESEARCH 141

Table 7.2 Relevance Screen

Coder initials: ___ ___ ___

Date: _____________

Report Characteristics

1. First author (Last, initials)

2. Journal

3. Volume

4. Pages

Inclusion Criteria

5. Is this study an empirical investigation of the effects of teacher 0. No


expectancies? 1. Yes

6. Is the outcome a measure of IQ? 0. No


1. Yes

0. No
7. Are the study participants in grades 15 at the start of the study?
1. Yes

8. Does the study involve a between-teacher comparison 0. No


(i.e., some teachers hold an expectancy while others do not)? 1. Yes

source: Authors compilation.

experiments that appear to adequately represent the con- cific research question. Research syntheses that aim to
structs of interest (for example, variations in participants address issues relating to uncovering causal relations,
and outcome measures), then a strong case can be made however, address many issues faced in all types of quan-
for excluding all but randomized comparisons. Further, titative research. As such, what follows is an example set
as suggested earlier, the strong theoretical underpinnings of codes designed to investigate the problem of whether
of the preference for randomization in studies investigat- teacher expectancies affect student IQ. Two aspects
ing causal relations could be viewed as a justification for of study quality are examined: internal validity and the
excluding all studies that do not employ randomization. construct validity of the outcome measures. Table 7.2
Even in the absence of empirical evidence, there is ample provides an example of how studies can be screened for
reason to be cautious, perhaps to the extent of excluding relevance. For included studies, table 7.3 shows how one
all but randomized comparisons. might code issues related to research design for studies,
and table 7.4 provides examples for coding the outcome
7.6.1 Coding Study Dimensions measures.
The example set out, of course, represents only a frac-
The specific aspects of study design and implementation tion of the study characteristics one would actually code
related to study quality will vary as a function of the spe- in a synthesis of this nature. For instance, synthesists also
142 CODING THE LITERATURE

Table 7.3 Coding for Internal Validity

9. Sampling strategy 1. Randomly sampled from a defined population


2. Stratified sampling from a defined population
3. Cluster sampling
4. Convenience sample
5. Cant tell

10. Group assignment mechanism 1. Random assignment


2. Haphazard assignment
3. Other nonrandom assignment
4. Cant tell

11. Assignment mechanism 1. Self-selected into groups


2. Selected into groups by others on a basis related to
outcome (e.g., good readers placed in the expectancy
group)
3. Selected into groups by others not known to
be related to outcome (e.g., randomized experiment)
4. Cant tell

12. Equating variables 0. None


1. Prior IQ
2. Prior achievement
3. Other _________________
4. Cant tell

13. If equating was done, was it done using a 0. n/a, equating was not done
statistical process (e.g., ANCOVA, weighting) or 1. Statistically
manually (e.g., hand matching)? 2. Manual matching

14. Time 1 sample size, expectancy group (at random ___ ___ ___
assignment)

15. Time 1 sample size, comparison group (at random ___ ___ ___
assignment)

16. Time 2 sample size, expectancy group (for effect ___ ___ ___
size estimation)

17. Time 2 sample size, comparison group (for effect ___ ___ ___
size estimation)

source: Authors compilation.


JUDGING THE QUALITY OF PRIMARY RESEARCH 143

Table 7.4 Coding Construct Validity

18. IQ measure used in study 1. Stanford-Binet 5


2. Wechsler (WISC) III
3. Woodcock-Johnson III
4. Other ________________
5. Cant tell ________________

19. Score reliability for IQ measure


____ ___
(If a range is given, code the median value)

20. Metric for score reliability 1. Internal consistency


2. Split-half
3. Test-retest
4. Cant tell
5. None given

21. Source of score reliability estimate 1. Current sample


2. Citation from another study
3. Cant tell
4. None given

22. Is the validity of the IQ measure mentioned? 0. No


1. Yes

23. If yes, was a specific validity estimate given? 0. No


1. Yes

24. If yes, what was the estimate? ___ ___ ___

25. If a validity estimate was given in the report, what was 1. Current sample
the source of the estimate? 2. Citation from another study
3. Cant tell
4. Validity not mentioned in report

26. What was the nature of the validity estimate? 1. Concurrent validity
2. Convergent validity
3. Predictive validity
4. Other
5. Cant tell (simply asserted that the measure is valid)
6. n/a, validity was not mentioned in the report

source: Authors compilation.


144 CODING THE LITERATURE

might want to know as much as possible about how the students would be receiving treatment as usual. If, how-
expectancies were generated and the specific nature of ever, this assumption were not accurate, the synthesists
the expectancy. These important substantive issues will would have to add an item to the coding guide and re-
vary from synthesis to synthesis. For example, some re- code all previous studies on that dimension.
searchers may manipulate teacher expectancies through Many categories of designs that might be present in
modifications of the students school files, and others other contexts are not in the list of possible designs. For
may do so by manipulating verbal feedback from the stu- example, within-group comparisons are not included, so
dents previous teachers. Further guidance can be found categories for interrupted time series and one-group pre-
in many textbooks (see, for example, Cooper 1998; Lipsey test-posttest designs do not appear on the coding form.
and Wilson 2001) and chapters (for example, Shadish and Regression discontinuity designs also are not represented.
Myers 2001, especially chapter 13), as well as in the pol- An Other code, however, is included so that unexpected
icy briefs of the Campbell Collaboration (Shadish and designs can be tracked.
Myers 2001), and the Open Learning materials of the Also of note in table 7.3 is that the coding sheet in-
Cochrane Collaboration (Alderson and Green 2002). cludes a mechanism for capturing sample sizes at the
beginning and at the end of the study for both the expec-
7.6.2 The Relevance Screen tancy and the comparison groups (questions 14 through
17). This information will allow the synthesists to com-
In the example that follows, assume the synthesists have pute both overall and differential attrition rates.
established that to be included studies must meet four Finally, table 7.4 addresses aspects of coding that relate
criteria. A study must examine the impact of teacher ex- to issues of the construct validity of the outcome mea-
pectancies on the IQ of students in grades one through sures. Here, the name of the IQ measure and the score reli-
five, and must do so by manipulating teacher expectancy ability estimate given in the study are coded, along with
conditions and comparing students for whom teachers some aspects related to the validity of the IQ measure.
were given varying expectancies. Any study not meeting
these criteria would be screened out. So, for example, 7.7 CONCLUSION
studies that investigate the impact of teacher expectancies
on student grades, or that do so by examining naturally Although most synthesists use a combination of a priori
occurring variation in expectancies and how they covary and post hoc approaches and neither one works perfectly
with IQ, would be excluded. Table 7.2 presents a coding in every situation, it should be clear that I believe a priori
form that will help guide the relevance screening process. rules are used too often in research synthesis. Absent em-
Note that if any one of questions five through eight is pirical evidence that justifies the rules employed and their
answered in the negative, the study will be excluded from operational definitions, it seems unlikely that these rules
further consideration. serve to increase the validity of the inferences arising
from a research synthesis. A better approach is to rely on
7.6.3 Example Coding these rules only when strong empirical justification exists
to support them, and to attempt to investigate empirically
Coding aspects of studies related to internal validity re- the influence of varying dimensions of study quality on
quires attention to both the study design and to how that the inferences arising from those studies.
study was carried out. Table 7.3 includes several design
issues, such as how groups were formed and what vari- 7.8 REFERENCES
ables were used to equate students, if applicable. Because
this synthesis could also include nonrandomized studies, Alderson, Phil, and Sally Green. 2002. Cochrane Collabo-
table 7.3 includes an item (#12) designed to capture pro- ration Open Learning Material for Reviewers. http://www.
cedures the researchers might have used to make the cochrane-net.org/openlearning.
groups more comparable (though it would be good to Berlin, Jesse, and Drummond Rennie. 1999. Measuring the
know this information even in a synthesis limited to ran- Quality of Trials: The Quality of Quality Scales. Journal
domized comparisons). The table does not include a code of the American Medical Association 282(11): 105355.
designed to capture how the comparison group was Bris, Peter A., Stephanie Zara, Marguerite Pappaioanou,
treated, because the assumption is that all comparison Jonathan Fielding, Linda Wright-De Agero, et al. 2000.
JUDGING THE QUALITY OF PRIMARY RESEARCH 145

Developing an Evidence-Based Guide to Community Pre- Hollis, Sally, and Fiona Campbell. 1999. What Is Meant
ventive ServicesMethods. American Journal of Preven- by Intention to Treat Analysis. Survey of Published
tive Medicine 18(1S): 3543. Randomised Controlled Trials. British Medical Journal
Campbell, Donald T. and Julian C. Stanley. 1966. Experi- 319(7211): 67074.
mental and Quasi-experimental Designs for Research. Jni, Peter, Anne Witshci, Ralph Bloch, and Matthias Egger.
Chicago: Rand McNally. 1999. The Hazards of Scoring the Quality of Clinical
Chalmers, Thomas C., Paul Celano, Henry S. Sacks, and Trials for Meta-Analysis. Journal of the American Medi-
Harry Smith Jr. 1983. Bias in Treatment Assignment in cal Association 282(11): 105460.
Controlled Clinical Trials. New England Journal of Medi- Lipsey, Mark W., and David B. Wilson. 1993. The Efficacy
cine 309: 135861. of Psychological, Educational, and Behavioral Treatment:
Colditz, Graham A., James N. Miller, and Frederick Mostel- Confirmation from Meta-Analysis. American Psycholo-
ler. 1989. How Study Design Affects Outcomes in Com- gist 48(12): 11811209.
parisons in Therapy, I: Medical. Statistics in Medicine Miller, James N., Graham A. Colditz, and Frederick Mostel-
8(4): 44154. ler. 1989. How Study Design Affects Outcomes in Com-
Cook, Thomas D., and Donald T. Campbell. 1979. Quasi- parisons of Therapy, II: Surgical. Statistics in Medicine
Experimentation: Design and Analysis Issues for Field Set- 8(4): 45566.
tings. Boston, Mass.: Houghton Mifflin. Murray, David M. 1998. Design and Analysis of Group
Cooper, Harris. 1998. Synthesizing Research: A Guide for Randomized Trials. New York: Oxford University Press.
Literature Reviews, 3rd ed. Thousand Oaks, Calif.: Sage Schulz, Kenneth F., Iain Chalmers, Richard J. Hayes, and
Publications. Douglas G. Altman. 1995. Empirical Evidence of Bias:
Coren, Ester and Mike Fisher. 2006. The Conduct of Syste- Dimensions of Methodological Quality Associated with
matic Research Reviews for SCIE Knowledge Reviews. Estimates of Treatment Effects in Controlled Trials. The
London: Social Care Institute for Excellence. http://www. Journal of the American Medical Association 273(5):
scie.org.uk/publications/researchresources/rr01.pdf. 40812.
DuBois, David L., Bruce E. Holloway, Jeffrey C. Valentine, Shadish, William R. 1989. The Perception and Evaluation
and Harris Cooper. 2002. Effectiveness of Mentoring Pro- of Quality in Science. In Psychology of Science: Contribu-
grams for Youth: A Meta-Analytic Review. American Jour- tions to Metascience, edited by Barry Gholson, William R.
nal of Community Psychology 30(2): 15797. Shadish Jr., Robert A. Niemeyer, and Arthur C. Houts.
Fowler, Robert L. 1988. Estimating the Standardized Mean Cambridge: Cambridge University Press.
Difference in Intervention Studies. Journal of Educational Shadish, William, and David Myers. 2001. Campbell Col-
Statistics 13(4): 33750. laboration Research Design Policy Brief. http://www
Glazerman, Steven, Dan M. Levy, and David Myers. 2003. .campbellcollaboration.org/MG/ResDesPolicyBrief.pdf.
Nonexperimental Versus Experimental Estimates of Ear- Shadish, William R., and Kevin Ragsdale. 1996. Random
nings Impacts. Annals of the American Academy of Politi- Versus Nonrandom Assignment in Psychotherapy Experi-
cal and Social Science 589(1): 6393. ments: Do You Get the Same Answer? Journal of Consul-
Greenland, Sander and Keith ORourke. 2001. On the Bias ting and Clinical Psychology 64(6): 12901305.
Produced by Quality Scores in Meta-Analysis, and a Hie- Shadish, William R., Thomas D. Cook, and Donald T. Cam-
rarchical View of Proposed Solutions. Biostatistics 2(4): pbell. 2002. Experimental and Quasi-Experimental Designs
46371. for Generalized Causal Inference. Boston, Mass.: Hough-
Guilford, Joy P. 1954. Psychometric Methods. New York: ton Mifflin.
McGraw-Hill. Smith, Mary Lee, and Gene V. Glass. 1977. Meta-Analysis
Hedges, Larry V., and Therese D. Pigott. 2001. The Power of Psychotherapy Outcome Studies. American Psycholo-
of Statistical Tests in Meta-Analysis. Psychological gist 32(9): 75260.
Methods 6(3): 20317. Thompson, Bruce, ed. 2002. Score Reliability: Contempo-
. 2004. The Power of Statistical Tests for Moderators rary Thinking on Reliability Issues. Newbury Park, Calif.:
in Meta-Analysis. Psychological Methods 9(4): 42645. Sage Publications.
Hedges, Larry V. 2007. Correcting a Significance Test for Valentine, Jeffrey C., and Harris Cooper. 2008. A Systema-
Clustering. Journal of Educational and Behavioral Statis- tic and Transparent Approach for Assessing the Methodo-
tics 32(2): 15179. logical Quality of Intervention Effectiveness Research: The
146 CODING THE LITERATURE

Study Design and Implementation Assessment Device Williamson, P. R., Carrol Gamble, Douglas G. Altman, and
(Study DIAD). Psychological Methods 13(2): 13049. Jane L. Hutton. 2005. Outcome Selection Bias in Meta-
Valentine, Jeffrey C., and Cathleen M. McHugh. 2007. The Analysis. Statistical Methods in Medical Research 14(5):
Effects of Attrition on Baseline Group Comparability in 51524.
Randomized Experiments in Education: A Meta-Analytic Zara, Stephanie, Linda Wright-De Agero, Peter A. Bris,
Review. Psychological Methods 12(3): 26882. Benedict I. Truman, David P. Hopkins, Michael H. Hen-
What Works Clearinghouse. 2006. What Works Clearing- nessy, et al. 2000. Data Collection Instrument and Proce-
house Evidence Standards for Reviewing Studies, revised dure for Systematic Reviews in the Guide to Community
September 2006. http://www.whatworksclearinghouse.org/ Preventive Services. American Journal of Preventive Me-
reviewprocess/study_standards_final.pdf. dicine 18(1S): 4470.
8
IDENTIFYING INTERESTING VARIABLES AND ANALYSIS OPPORTUNITIES

MARK W. LIPSEY
Vanderbilt University

CONTENTS
8.1 Introduction 148
8.1.1 Potentially Interesting Variables 148
8.1.2 Study Results 148
8.1.2.1 Results on Multiple Measures 148
8.1.2.2 Results for Subsamples 149
8.1.2.3 Results Measured at Different Times 149
8.1.2.4 The Array of Study Results 150
8.1.3 Study Descriptors 150
8.2 Analysis Opportunities 152
8.2.1 Descriptive Analysis 152
8.2.1.1 Descriptive Information About Study Results 152
8.2.1.2 Study Descriptors 152
8.2.2 Relationships Among Study Descriptors 153
8.2.3 Relationships Between Effect Sizes and Study Descriptors 154
8.2.3.1 Extrinsic Variables 154
8.2.3.2 Method Variables 154
8.2.3.3 Substantive Variables 155
8.2.4 Relationships Among Study Results 155
8.3 Conclusion 157
8.4 References 157

147
148 CODING THE LITERATURE

8.1 INTRODUCTION achievement, attendance, and attitudes toward school.


They may also represent multiple measures of the same
Research synthesis relies on information reported in a se- construct, for instance achievement measured both by a
lection of studies on a topic of interest. The purpose of standardized achievement test and grade point average. A
this chapter is to examine the types of variables that can study of predictors of juvenile delinquency thus might
be coded from those studies and to outline the kinds of report the correlations of one measure each of age, gen-
relationships that can be examined in the analysis of the der, school achievement, and family structure with a cen-
resulting data. It identifies a range of analysis opportuni- tral delinquency measure. Each such correlation can be
ties and endeavors to stimulate the synthesists thinking coded as a separate effect size. Similarly, a study of gen-
about what data might be used in a research synthesis and der differences in aggression might compare males and
what sorts of questions might be addressed. females on physical aggression measured two different
ways, verbal aggression measured three ways, and the ag-
8.1.1 Potentially Interesting Variables gressive content of fantasies measured one way, yielding
six possible effect sizes. Moreover, the various studies
Research synthesis revolves around the effect sizes that
eligible for a research synthesis may differ among them-
summarize the main findings of each study of interest. As
selves in the type, number, and mix of measures.
detailed elsewhere in this volume, effect sizes come in a
The synthesist must decide what categories of con-
variety of formscorrelation coefficients, standardized
structs and measures to define and what effect sizes to
differences between means, odds ratios, and so forth
code in each. The basic options are threefold. One ap-
depending on the nature of the quantitative study results
proach is to code effect sizes on all measures for all con-
they represent. Study results embodied in effect sizes,
structs in relatively undifferentiated form. This strategy
therefore, constitute one major type of variable that is
would yield, in essence, a single global category of study
sure to be of interest.
results. For example, the outcome measures used in re-
Research studies, however, have other characteristics
search on the effectiveness of psychotherapy show little
that may be of interest in addition to their results. For in-
commonality from study to study. A synthesist might, for
stance, they might be described in terms of the nature of
some purposes, treat all these outcomes as the samethat
the research designs and procedures used, the attributes of
is, as instances of results relevant to some general con-
the subject samples, and various other features of the set-
struct of personal functioning. An average over all the
tings, personnel, activities, and circumstances involved.
resulting effect sizes addresses the global question of
These characteristics taken together constitute the second
whether psychotherapy has generally positive effects on
major category of variables of potential concern to the
the mix of measures typical to psychotherapy research.
research synthesiststudy descriptors.
This is the approach used in Mary Smith and Gene Glasss
8.1.2 Study Results classic meta-analysis (1977).
At the other extreme, effect sizes may be coded only
Whatever the nature of the issues bearing on study re- when they relate to a particular measure of interest to the
sults, one of the first challenges the synthesist faces is the synthesistthat is, a specific operationalization of the
likelihood that many of the studies of interest will report construct of interest or, perhaps, a few such specific op-
multiple results relevant to those issues. If we define a erationalizations. Mean effect sizes would then be com-
study to mean a set of observations taken on a subject puted and reported separately for each such measure. In a
sample on one or more occasions, there are three possible meta-analysis of the effects of transcranial magnetic
forms of multiple results that may occur. There may be stimulation on depression, for instance, Jose Luis Martin
different results for different measures, for different sub- and his colleagues examined effect sizes only for out-
samples, and for different times of measurement. All of comes measured on the Beck Depression Inventory and,
these variations have implications for conceptualizing, separately, those measured on the Hamilton Rating Scale
coding, and analyzing effect sizes. for Depression (2003).
8.1.2.1 Results on Multiple Measures Each study An intermediate strategy is to define one or more sets
in a synthesis may report results on more than one mea- of constructs or measurement categories that include dif-
sure. These results may represent different constructs, ferent operationalizations but distinguish different content
that is, different things being measured such as academic domains of interest. Study results that fell within those
IDENTIFYING INTERESTING VARIABLES AND ANALYSIS OPPORTUNITIES 149

categories would then be coded but other results would a validation study of a personnel selection test might re-
be ignored. In a synthesis of the effects of exercise for port test-criterion correlations for different industries, or
older adults, for example, Yael Netz and her colleagues a study of the effects of drug counseling for teenagers
distinguished four physical fitness outcome constructs might report results separately for males and females.
(cardiovascular, strength, flexibility, and functional ca- Coding effect sizes for each subsample permits a more
pacity) and ten psychological outcome constructs (anxi- discriminating analysis of the relationship between sub-
ety, depression, energy, life satisfaction, self-efficacy, and ject characteristics and study results than is possible from
so forth), while allowing different ways of measuring the overall study results alone.
respective construct within each category (2005). These Two considerations bear on the potential utility of
constructs were considered to represent distinctly differ- effect sizes from subsamples. First, the variable distin-
ent kinds of outcomes and the corresponding effect sizes guishing subsamples must represent a dimension of po-
were kept separate in the coding and analysis. tential interest to the synthesist. Effect sizes computed
Whether global or highly specific construct categories separately for male and female subsamples, for instance,
are defined for coding effect sizes, the synthesist is likely will be of concern only if there are theoretical, practical,
to find some studies that report results on multiple mea- or exploratory reasons to examine whether gender mod-
sures within a given construct category. In such cases, erates the magnitude of effects. Second, a given subject
one of several approaches can be taken. The synthesist breakdown must be reported widely enough to yield a
can simply code and analyze all the effect sizes contrib- sufficient number of effect sizes to permit meaningful
uted to a category, including multiple ones from the same analysis. If most studies do not report results separately
study. This permits some studies to provide more effect for males and females, the utility in coding effect sizes
sizes than others and thus to be overrepresented in the for those subsamples is limited.
synthesis results. It also introduces statistical dependen- As with effect sizes from multiple measures within a
cies among the effect sizes because some of them are single study, effect sizes for multiple subsamples within
based on the same subject samples. These are issues that a study pose issues of statistical dependency if the sub-
will have to be dealt with in the analysis. Alternatively, samples are not mutually exclusivethat is, subsamples
criteria can be developed for selecting the single most ap- that share subjects. Although the female subsample in a
propriate measure within each construct category and ig- breakdown by gender would not share subjects with the
noring the remainder (despite the related loss of data). Or male subsample, it would almost certainly share subjects
multiple effect sizes within a construct category can be with any subsample resulting from a breakdown by age.
coded and then averaged (perhaps using a weighted aver- In addition, however, any feature shared in common by
age) to yield a single mean or median effect size within subsamples within a studyfor example, being studied
that category for each study (for further discussion of by the same investigatorcan introduce statistical de-
these issues, see chapter 19, this volume). pendencies into the effect sizes computed for those sub-
The most important consideration is the purpose of the samples. Analysis of effect sizes based on subsamples,
synthesis. For some purposes, the study results of interest therefore, must be treated like any analysis that uses more
may involve only one construct, such as when synthesis than one effect size from each study and thus possibly
of research evaluating educational innovations focuses on violates assumptions of statistical independence (see
achievement test results. For others, the synthesist may chapter 19, this volume).
be interested in a broader range of study results and wel- 8.1.2.3 Results Measured at Different Times Some
come diverse measures and constructs. Having a clear studies may report findings on the same variables mea-
view of the purposes of a research synthesis and adopting sured at different times. A study of the effects of a smok-
criteria for selecting and categorizing study results con- ing cessation program, for example, might report the out-
sistent with those purposes are fundamental requirements come immediately after treatment and at six-month and
of good synthesis. These steps define the effect sizes the one-year follow-ups. Such results might also be reported
synthesist will be able to analyze and thus shape the find- both for the overall subject sample and for various sub-
ings of the synthesis. samples. Such time-series information potentially permits
8.1.2.2 Results for Subsamples In addition to overall an interesting analysis of temporal patterns in outcome
results, many studies within a synthesis may report break- decay curves for the persistence of treatment effects, for
downs for one or more subject subsamples. For example, example.
150 CODING THE LITERATURE

As with effect sizes for subsamples, the synthesist must of the generalizability of the findings to the entire set be-
determine whether follow-up results are reported widely cause of the distinctive features that may characterize that
enough to yield enough data to be worth analyzing. Even subset. For instance, in a synthesis of the correlates of
in favorable circumstances, only a portion of the studies effective job performance, the subset of studies that break
in a synthesis is likely to provide measures at different out results by age groups might be mostly those con-
time points. A related problem is that the intervals be- ducted in large corporations. What is learned from them
tween times of measurement may vary widely from study about age differences might not apply to employees of
to study, making these results difficult to summarize small businesses. Studies that report differentiated re-
across studies. If this occurs, one approach is for the syn- sults, nonetheless, offer the synthesist the opportunity to
thesist to establish broad categories for follow-up periods conduct more probing analyses of the issues of interest.
and code each result in the category it most closely fits. Including such differentiation in a research synthesis can
Another approach is to code the time of measurement for add rich and useful detail to its findings.
each result as an interval from some common reference
pointfor instance, the termination of treatment. The 8.1.3 Study Descriptors
synthesist can then analyze the functional relationship
between the effect size and the amount of time that has Whatever the set of effect sizes under investigation, it is
passed since the completion of treatment. usually of interest to also examine matters related to the
Because the subjects at time 1 will be much the same characteristics of the studies that yield those results. To
(except for attrition) as those at follow-up time 2, the accomplish this, the synthesist must identify and code in-
effect sizes for these different occasions will not be statis- formation from each study about the particulars of its
tically independent. As with multiple measures or over- subjects, methods, treatments, and the like. Research syn-
lapping subsamples, these effect sizes cannot be included thesists have shown considerable diversity and ingenuity
together in an analysis that assumes independent data in the study characteristics they have coded and the cod-
points unless special adjustments are made. ing schemes they have used (for a specific discussion of
8.1.2.4 The Array of Study Results As the previous the development, application, and validation of coding
discussion indicates, the quantitative results from a study schemes, see chapter 9, this volume). Described here are
selected for synthesis may be reported for multiple con- the general types of study descriptors that the synthesist
structs, for multiple measures of each construct, for the might consider in anticipation of conducting certain analy-
total subject sample, for various subsamples, and for mul- ses on the resulting data.
tiple times of measurement. To effectively represent key To provide a rudimentary conceptual framework for
study results and support interesting analyses, the synthe- study descriptors, we first assume that study results are
sist must establish conceptual categories for each dimen- determined conjointly by the nature of the substantive
sion. These categories will group those effect sizes judged phenomenon under investigation and the nature of the
to be substantially similar for the purposes of the synthe- methods used to study it. The variant of the phenomenon
sis, differentiate those believed to be importantly differ- selected for study and the particular methods applied, in
ent, and ignore those judged irrelevant or uninteresting. turn, may be influenced by the characteristics of the re-
Once the categories are developed, it should be possible searcher and the research context. These latter will be
to classify each effect size that can be coded from any labeled extrinsic factors because they are extrinsic to the
study according to the type of construct, subsample, and substantive and methodological factors assumed to di-
time of measurement it represents. When enough studies rectly shape study results. In addition, the actual results
yield comparable effect sizes, that set of results can be and characteristics of a study must be fully and accurately
separately analyzed. Some studies will contribute only a reported to be validly coded in a synthesis. Various as-
single effect size (one measure, one sample, one time) to pects of study reporting thus constitute a second type of
the synthesis. Others will report more differentiated re- extrinsic factor because they, too, may be associated with
sults and may yield a number of useful effect sizes. the study results represented in a synthesis though they
It follows that different studies may provide effect are not assumed to directly shape those results.
sizes for different analyses on different categories of ef- This scheme, though somewhat crude, serves to iden-
fect sizes. Any analysis based on effect sizes from only a tify four general categories of study descriptors that may
subset of the studies under investigation raises a question interest the synthesist: substantive features of the matter
IDENTIFYING INTERESTING VARIABLES AND ANALYSIS OPPORTUNITIES 151

the studies investigate, particulars of the methods and sponsorship), and reporting (such as the form of publica-
procedures used to conduct those investigations, charac- tion). Such factors would not typically be expected to di-
teristics of the researcher and the research context and rectly shape study results but nonetheless may be related
attributes of the way the study is reported and the coding to something that does and thereby correlate with coded
done by the synthesist. The most important of these, of effect sizes. Whether a researcher is a psychologist or a
course, are the features substantively pertinent to charac- sociologist, for instance, should not itself directly deter-
terizing the phenomenon under investigation. In this cat- mine study results. Disciplinary affiliation may, however,
egory are such matters as the nature of the treatment influence methodological practices or the selection of
provided; the characteristics of the subjects used; the cul- variants of the phenomenon to study, which, in turn, could
tural, geographical, or temporal setting; and those other affect study results.
influences that might moderate the events or relation- Another type of extrinsic factor involves those aspects
ships under study. It is these variables that permit the of the reporting of study methods and results that might
synthesist to identify differences among studies in the yield different values in the synthesists coding even if all
substantive characteristics of the situations they investi- studies were investigating the same phenomenon with the
gate that may account for differences in their results. same methods. For example, studies may not report im-
From such information, the synthesist may be able to de- portant details of the procedures, measures, treatments,
termine that one treatment variation is more effective or results. This requires coders to engage in guesswork or
than another, that a relationship holds for certain types of record missing values on items that, with better reporting,
subjects or circumstances but not for others, or that there could be coded accurately. Inadequacies in the informa-
is a developmental sequence that yields different results tion reported in a study, therefore, may influence the ef-
at different time periods. Findings such as these, of fect sizes coded or the representation of substantive and
course, result in better understanding of the nature of the methodological features closely related to effect sizes
phenomenon under study and are the objective of most in ways that distort the relationships of interest to the syn-
syntheses. thesis.
Those study characteristics not related to the substan- A synthesist who desires a full description of study
tive aspects of the phenomenon involve various possible characteristics will want to make some effort to identify
sources of distortion, bias, or artifact in study results as and code all those factors thought to be related, or possi-
they are operationalized in the original research or in the bly related, to the study results of interest. From a practi-
coding for the synthesis. The most important are method- cal perspective, the decision about what specific study
ological or procedural aspects of the way in which the characteristics to code will have to reconcile two compet-
studies were conducted. These include variations in the ing considerations. First, synthesis provides a service by
designs, research procedures, quality of measures, and documenting in detail the nature of the body of research
forms of data analysis that might yield different study re- bearing on a given issue. This consideration motivates a
sults even if every study were investigating the same phe- coding of a broad range of study characteristics for
nomenon. A synthesist may be interested in examining descriptive purposes. On the other hand, many codable
the influence of method variables for two reasons. First, features of studies have limited value for anything but de-
analysis of the relationships between method choices and scriptive purposesthey may not be widely reported in
study results provides useful information about which as- the literature or may show little variation from study to
pects of research procedures make the most difference study. Finally, of course, some study characteristics will
and, hence, should be most carefully selected by research- simply not be germane (though it may be difficult to
ers. Second, method differences confound substantive specify in advance what will prove relevant and what will
comparisons among study results and may thus need to not). Documenting study characteristics that are poorly
be included in the analysis as statistical control variables reported, lacking in variation, or irrelevant to present pur-
to help disentangle the influence of methods on those re- poses surely has some value, as the next section will
sults. argue. However, given how time-consuming and expen-
Factors extrinsic to both the substantive phenomenon sive coding in research synthesis is, the synthesist inevi-
and the research methods include characteristics of the tably must find some balance between coding broadly for
researcher (such as gender and disciplinary affiliation), descriptive purposes and coding narrowly around the spe-
the research circumstances (such as the nature of study cific target issues of the particular synthesis.
152 CODING THE LITERATURE

8.2 ANALYSIS OPPORTUNITIES A useful illustration of descriptive analysis of effect


sizes occurs in a synthesis of sixty-six studies that re-
A researcher embarks on a synthesis to answer certain ported correlations representing the stability of vocational
questions on the basis of statistical analysis of data devel- interests across the lifespan (Low et al. 2005). Douglas
oped by systematically coding the results and character- Low and his colleagues identified and reported descrip-
istics of the studies under consideration. There are, of tive statistics for distributions of effect sizes for eight dif-
course, many technical issues involved in such statistical ferent age categories ranging from twelve to forty. They
analysis; these are discussed elsewhere in this volume. also reported effect size means for different birth cohorts
For present purposes it is more important to obtain an ranging from those born before 1930 to those born in the
overview of the various types of analysis and the kinds of 1980s. These descriptive statistics allow the reader to as-
insights they might yield. With such an overview in mind, sess the uniformity of the correlations across age and his-
the synthesist should be in a better position to know what torical period. Furthermore, the associated numbers of
sorts of questions might be answered by synthesis and effect sizes and confidence intervals also reported allow
what data must be coded to address them. the depth and statistical precision of the available evi-
Four generic forms of analysis are outlined here. The dence for each mean to be appraised.
first, descriptive analysis, uses the coded variables to Description of effect size distributions may be more
provide an overall picture of the nature of the research than descriptive in the merely statistical senseinferential
studies included in the synthesis. The other three forms tests also may be reported regarding such matters as the
examine relationships among coded variables. As de- homogeneity of the effect sizes and the confidence inter-
scribed previously, variables coded in a synthesis can be val for their mean value (see chapter 14 this volume). Re-
divided into those that describe study results (effect sizes) porting other information closely related to effect sizes
and those that describe study characteristics (study de- may also be appropriate. Such supplemental information
scriptors). Three general possibilities for analysis of might include the number of effect sizes per study, the
relationships emerge from this scheme: among study de- basis for computation or estimation of effect sizes, whether
scriptors, between study descriptors and effect sizes, and effects were statistically significant, what statistical tests
among study effect sizes. were used in the original studies, and coder confidence in
the effect size computation.
8.2.1 Descriptive Analysis 8.2.1.2 Study Descriptors Research synthesists all
too often report relatively little information about studies
By its nature, research synthesis involves the collection of other than their results. It can be quite instructive, how-
information describing key results and various important ever, to provide breakdowns of the coded variables that
attributes of the studies under investigation. Descriptive describe substantive study characteristics, study methods,
analysis of these data can provide a valuable picture of the and extrinsic factors such as publication source. This dis-
nature of a research literature. It can help identify issues closure accomplishes two things. First, it informs readers
that have already been studied adequately and gaps in the about the specific nature of the research that has been
literature that need additional study. It can also highlight chosen for the synthesis so they may judge its compre-
common methodological practices and provide a basis for hensiveness and biases. For example, knowing the pro-
assessing the areas in which improvement is warranted. portions of unpublished studies, or studies conducted
The descriptive information provided by a synthesis deals before a certain date, or studies with a particular type of
with either study results or study descriptors. Each of design might be quite relevant to interpreting the findings
these is considered in turn in the following sections. of the synthesis. Also, synthesists frequently discover
8.2.1.1 Descriptive Information About Study Re- that studies do not adequately report some information
sults Descriptive information about study results is al- that is important to the synthesis. It is informative for
most universally reported in research synthesis. The cen- readers to know the extent of missing data on various
tral information, of course, is the distribution of effect coded variables.
sizes across studies. As noted earlier, there may be nu- Second, summary statistics for study characteristics
merous categories of effect sizes; correspondingly, there provide a general overview of the nature of the research
may be numerous effect size distributions to describe and available on the topic addressed. This overview allows
compare (see chapter 12 and chapter 13, this volume). readers to ascertain whether researchers have tended to
IDENTIFYING INTERESTING VARIABLES AND ANALYSIS OPPORTUNITIES 153

use restricted designs, measures, samples, or treatments Table 8.1 Correlation of Selected Study Characteris-
and what the gaps in coverage are. Careful description of tics with Method of Subject Assignment
the research literature establishes the basis for critique of (Nonrandom Versus Random) in a Synthesis
research practices in a field and helps identify character- of Cognitive-Behavioral Therapy for
istics desirable for future research. Offenders
Swaisgood and Shepherdson used this descriptive ap-
proach to examine the methodological quality of studies Across-Studies
of the effects of environmental enrichment on stereo- Correlation
typic behavior among zoo animals (2005). They provided Variables (N  58)
breakdowns of the characteristics of the studies with re-
Academic author (no vs. yes) .47*
gard to sample sizes, observation times, research designs,
Grant funded (no vs. yes) .28*
the specificity of the independent variables, the selection Publication type (unpublished vs. .26*
of subject species, and the statistical analyses. This in- published)
formation was then used to assess the adequacy of the Practice versus demonstration program .36*
available studies and to identify needed improvements in Total sample size .26*
methodology and reporting. Manualized program (no versus yes) .03
It is evident that much can be learned from careful de- Brand name program (no versus yes) .11
scription of study results and characteristics in research Implementation monitored
synthesis. Indeed, it can be argued that providing a broad (no versus yes) .27*
description and appraisal of the nature and quality of the Mean age of sample .36*
Sample percent male .08
body of research under examination is fundamental to all
Sample percent minority .20
other analyses that the synthesist might wish to conduct.
Proper interpretation of those analyses depends on a clear source: Authors calculations.
understanding of the character and limitations of the pri- * p  .05
mary research on which they are based. It follows that
conducting and reporting a full descriptive analysis should
be routine in research synthesis. Another kind of analysis of relationships among study
descriptors focuses on key study features to determine if
8.2.2 Relationships Among Study Descriptors they are functions of other, temporally or logically prior
features. For example, a synthesist might examine whether
The various descriptors that characterize the studies in a a particular methodological characteristic, perhaps sam-
synthesis may also have interesting interrelationships ple size, is predictable from characteristics of the re-
among themselves. It is quite unlikely that study charac- searcher or the nature of the research setting. Similarly,
teristics will be randomly and independently distributed the synthesist could examine the relationship between the
over the studies in a given research literature. More likely, type of subjects in a study and the treatment applied to
they will fall into patterns in which certain characteristics those subjects.
tend to occur together. Many methodological and extrin- Analysis of relationships among study descriptors can
sic features of studies may be of this sort. We might find, be illustrated using data from Nana Landenberger and
for example, that published studies are more likely to be Mark Lipseys meta-analysis of the effects of cognitive-
federally funded and have authors with academic affilia- behavioral therapy for offenders (2005). One important
tions. Substantive study characteristics may also cluster in methodological characteristic in this research is whether a
interesting ways, such as when certain types of treatments study employed random assignment of subjects to experi-
are more frequently applied to certain types of subjects. mental conditions. Table 8.1 shows the across-study cor-
Analysis of across-study intercorrelations of descriptors relations that proved interesting between method of sub-
that reveals these clusters has not often been explored in ject assignment and other variables describing the studies.
research synthesis. Nonetheless, such analysis should be The relationships in table 8.1 show that studies using
useful both for data reductionthat is, creation of com- random assignment were more likely to have authors
posite moderator variablesand to more fully describe with an academic affiliation, be grant funded, formally
the nature of research practices in the area of inquiry. published, and study a demonstration program rather than
154 CODING THE LITERATURE

a real-world practice program. Randomized studies also 8.2.3.1 Extrinsic Variables Extrinsic variables, as de-
tended to have smaller and younger samples and to mon- fined earlier, are not generally assumed to directly shape
itor the treatment implementation. On the other hand, actual study results even though they may differentiate
they were no more likely to use a manualized or brand studies that yield different effect sizes. In some instances,
name cognitive-behavioral program or to involve male or they may be marker variables that are associated, in turn,
minority samples. with research practices that do exercise direct influence
A distinctive aspect of analysis of the interrelations of on study results. It is commonly found in synthesis, for
study descriptors is that it does not involve the effect example, that published studies have larger effect sizes
sizes that are central to most other forms of analysis in than unpublished studies (Rothstein, Sutton, and Boren-
research synthesis. An important implication of this fea- stein 2005). Publication of a study, of course, does not
ture is that the sampling unit for such analyses is the itself inflate the effect sizes, but may reflect the selection
study itself, not the subject within the study. That is, criteria and reporting proclivities of the authors, peer re-
features such as the type of design chosen, nature of treat- viewers, and editors who decide if and how a study will
ment applied, publication outlet, and other similar char- be published. Analysis of the relationship of extrinsic
acteristics describe the study, not the subjects, and thus study variables to effect size, therefore, may reveal inter-
are not influenced by subject-level sampling error. This esting aspects of research and publication practices in a
simplifies much of the statistical analysis when investi- field, but is limited in its ability to reveal why those prac-
gating these types of relationships because attention need tices are associated with different effect sizes.
not be paid to the varying number of subjects represented A more dramatic example is a case that appears in the
in different studies. synthesis literature in which the gender of the authors
was a significant predictor of the results of studies of sex
8.2.3 Relationships Between Effect Sizes differences in conformity. Alice Eagly and Linda Carli
and Study Descriptors conducted a research synthesis of this literature and found
that having a higher percentage of male authors was asso-
The effect sizes that represent study results often show ciated with findings of greater female conformity (1981).
more variability within their respective categories than One might speculate that this reflects some bias exercised
would be expected on the basis of sampling error alone by male researchers, but it is also possible that their re-
(see chapters 15 and 16, this volume). A natural and quite search methods were not the same as those of female re-
common question for the analyst is whether any study searchers. Indeed, Betsy Becker demonstrated that male
descriptors are associated with the magnitude of the ef- authorship was confounded with characteristics of the
fect, that is, whether they are moderators of effect size. It outcome measures and the number of confederates in the
is almost always appropriate for a synthesis to conduct a study. Both of these method variables, in turn, were cor-
moderator analysis to identify at least some of the cir- related with study effect size (1986).
cumstances under which effect sizes are large and small. 8.2.3.2 Method Variables The category of modera-
Moderator analysis is essentially correlational, examining tor variables that has to do with research methods and
the covariation of selected study descriptors and effect procedures is particularly important and too often under-
size, though it can be conducted in various statistical for- represented in synthesis. Experience has shown that vari-
mats. In such analysis, the study effect sizes become the ation in study effect sizes is often associated with meth-
dependent variables and the study descriptors become the odological variation among the studies (Lipsey 2003).
independent or predictor variables. One reason these relationships are interesting is that
The three broad categories of study descriptors identi- much can be learned from synthesis about the connection
fied earlierthose characterizing substantive aspects of between researchers choice of methods and the results
the phenomenon under investigation, those describing those methods yield. Such knowledge provides a basis
study methods and procedures, and those dealing with for examining research methods to discover which as-
extrinsic matters of research circumstances, reporting, pects introduce the greatest bias or distortion into study
coding, and the likeare all potentially related to study results.
effect sizes. The nature and interpretation of relationships Investigation of method variables in their meta-analysis
involving descriptors from these three categories, how- of the association of the NRG1 gene with schizophrenia,
ever, is quite different. for example, allowed Marcus Munafo and his colleagues
IDENTIFYING INTERESTING VARIABLES AND ANALYSIS OPPORTUNITIES 155

to determine that there was no difference in the findings program configurations that should produce the greatest
of studies that used a family-based design and those gains for the participating students.
that used a case control design (2006). Similarly, Rebecca Where there are multiple classes of effect size vari-
Guilbault and her colleagues discovered that the results of ables as well as of substantive moderator variables, quite
studies of hindsight bias depended in part on whether re- a range of analyses of interrelations is possible. For in-
spondents were given general information questions or stance, investigation can be made of the different corre-
questions about real world events and whether they were lates of effects on different outcome constructs. Kenneth
asked to provide an objective or subjective estimate of Deffenbacher, Brian Bornstein, and Steven Penrod used
probability (2004). On measurement issues, Lsel and this approach in a synthesis of the effects of prior expo-
Schmucker (2005) found that self-report measures showed sure to mugshots on the ability of eyewitnesses to iden-
larger treatment effects on recidivism for sex offenders tify offenders in police lineups (2006). They found that
than measures drawn from criminal records. Findings mugshot commitment, that is, identification of someone
such as these provide invaluable methodological informa- in a mugshot prior to the lineup, was a substantial mod-
tion to researchers seeking to design valid research in erator of both the effect on the proportion of correct re-
their respective areas of study. sponses and on the proportion of false positives. Another
Another reason to be concerned about the relationship possibility for differentiated analysis is to explore effect
between methodological features of studies and effect size relationships for different subsamples. This option
sizes is that method differences may be correlated with was illustrated in Sandra Wilson, Mark Lipsey, and Haluk
substantive differences. In such confounded situations, a Soydans comparative analysis of the effects of juvenile
synthesist must be careful not to attribute effect size dif- delinquency programs separately for racial minority and
ferences to substantive factors when those factors are also majority offenders (2003).
related to methodological factors. For example, in a meta- The opportunity to conduct systematic analysis of the
analysis of interventions for juvenile offenders, Mark relationships between study descriptors and study results
Lipsey found that such substantive characteristics as the is one of the most attractive aspects of research synthesis.
gender mix of the sample and the amount of treatment Certain limitations must be kept in mind, however. First,
provided were not only related to the recidivism effect the synthesis can examine only those variables that can
sizes but were also significantly related to the type of re- be coded from primary studies in essentially complete
search design (randomized or nonrandomized) used in and accurate form. Second, this aspect of synthesis is a
the study (2003). Such confounding raises a question form of observational study, not experimental study.
about whether the differential effects associated with Many of the moderator variables that show relationships
gender and amount of treatment reflect real differences in with effect size may also be related to each other. Con-
treatment outcomes or only spurious effects stemming founded relationships of that sort are inherently ambigu-
from the correlated methodological differences. ous with regard to which of the variables are the more
8.2.3.3 Substantive Variables Once disentangled from influential ones or, indeed, whether the influence stems
method variables, substantive variables are usually of from other unmeasured and unknown variables correlated
most interest to the synthesist. Determining that effect with those being examined. As noted, this situation is par-
size is associated with type of subjects, treatments, or set- ticularly problematic when method variables are con-
tings often has considerable theoretical or practical im- founded with substantive ones. Even without confounded
portance. A case in point is the meta-analysis of the characteristics, the observational nature of synthesis tem-
effects of peer-assisted learning on the achievement of pers the confidence with which the analyst can describe
elementary school students that Cynthia Rohrbeck and the causal influence of study characteristics on their find-
her colleagues conducted (2003). Extensive moderator ings (for more discussion of the distinction between
analysis revealed that the largest effect sizes were associ- study-generated and synthesis-generated evidence, see
ated with study samples of younger, urban, low-income, chapter 2 this volume).
and minority students. Moreover, effect sizes were larger
for interventions that used rewards based on the efforts of 8.2.4 Relationships Among Study Results
all team members, individualized measures of student ac-
complishments, and provided students with more auton- Study results in research synthesis are reported most
omy. These findings point research and practice toward often either as an overall mean effect size or as mean
156 CODING THE LITERATURE

effect sizes for various categories of results. When mul- direction (2001). In particular, those correlations showed
tiple classes of effect sizes are available, however, there is that increased school bonding was strongly associated
potential for informative analysis of the patterns of co- with decreased problem behaviors (r  .86), with aca-
variation among the effect sizes themselves. This can be demic performance associated to a lesser extent (r  .33)
especially fruitful when different results from the same and social competency skills even smaller (r  .12).
study represent different constructs. In that situation, a Moreover, those relationships were relatively unchanged
synthesist can examine whether the magnitude of effects when various potentially relevant between-study differ-
on one construct is associated, across studies, with the ences were introduced as control variables in multiple
magnitude of effects on another construct. Some studies regression analyses. By examining the across-study co-
of educational interventions, for instance, may measure variation of effects on different outcome constructs, there-
effects on both achievement and student attitude toward fore, we obtain some insight into the patterning of the
instruction. Under such circumstances, a synthesist might change brought about by intervention and the plausibility
wish to ask whether the intervention effects on achieve- that some changes mediate others (Shadish 1996).
ment covary with the effects on attitude. That is, if an in- An even more ambitious form of synthesis may be
tervention increased the achievement scores for a sample used to construct a correlation (or covariance) matrix that
of students, did it also improve their attitude scores? The represents the interrelations among many different vari-
across-studies correlation, though potentially interesting, ables in a research literature. Each study in such a synthe-
does not necessarily imply a within-study relationship. sis contributes one or more effect sizes representing the
Thus, achievement and attitude might covary across stud- correlation between two of the variables of interest. The
ies, but not among the students in the samples within the resulting synthetic correlation matrix can then be used to
studies. test multivariate path or structural equation models. An
Another example comes from the analysis Stacy Na- example of such an analysis is a synthesis on the relation-
jaka, Denise Gottfredson, and David Wilson reported on ship between workers perceptions of the organizational
the relationship between the effects of intervention on climate in their workplace and such work outcomes as
youth risk factors and the problem behaviors with which employee satisfaction, psychological well-being, motiva-
they are presumed to be associated (2001). Problem be- tion, and performance (Parker et al. 2003). Christopher
haviorsdelinquency, alcohol and drug use, dropout and Parker and colleagues found 121 independent samples in
truancy, aggressive and rebellious behaviorwere the ninety-four studies that reported the correlations among
major outcome variables of interest in their meta-analysis at least some of the variables relevant to their question.
of school-based prevention programs. Many studies, Those correlational effect sizes were synthesized into a
however, also reported intervention effects on academic single correlation matrix and corrected for measurement
performance, school bonding (liking school, educational unreliability. Their final matrix contained eleven correla-
aspirations), and social competency skills. These charac- tions, six relating to employee perceptions of the organi-
teristics are known to be predictive risk factors for prob- zational climate and five relating to work outcomes, with
lem behaviors. The authors thus reasoned that treatment each synthesized from between two and forty-seven study
induced reductions on these factors should be associated estimates. This matrix, then, became the basis for a path
with reductions in problem behaviorsthat is, changes analysis testing their model. The final model showed not
in risk should mediate changes in problem behavior. In only relatively strong relationships between organiza-
some of the studies in their synthesis, outcomes were tional climate variables and worker outcomes, but also
measured on one of these risk factors and also on prob- that the relationships between climate and employee mo-
lem behaviors. This made it possible to ask whether the tivation and performance were fully mediated by employ-
effect sizes for the risk factor were correlated with the ees work attitudes.
effect sizes for problem behaviors. That is, if intervention A potential problem in analyzing multiple categories
produced an increase in academic performance, school of study results is that all categories of effects will not
bonding, or social skills, was there a corresponding de- be reported by all studies. Thus, each synthesized effect
crease in problem behavior? size examining a relationship will most likely draw on
Najaka and her colleagues found that the across-study a different subset of studies creating uncertainty about
correlations between the effect sizes for the risk factors their generality and comparability (for more on this and
and those for problem behaviors were in the expected related issues having to do with analyzing synthesized
IDENTIFYING INTERESTING VARIABLES AND ANALYSIS OPPORTUNITIES 157

correlation matrices, see chapter 20, this volume; Becker Eagly, Alice H., and Linda L. Carli. 1981. Sex of Resear-
2000). chers and Sex-Typed Communications as Determinants of
Sex Differences in Influenceability: A Meta-Analysis of So-
8.3 CONCLUSION cial Influence Studies. Psychological Bulletin 90(1): 120.
Guilbault, Rebecca L., Fred B. Bryant, Jennifer H. Broc-
Research synthesis can be thought of as a form of survey kway, and Emil J. Posavac. 2004. A Meta-Analysis of
research in which the subjects interviewed are not people Research on Hindsight Bias. Basic and Applied Social
but research reports. The synthesist prepares a question- Psychology 26(23): 10317.
naire of items of interest, collects a sample of research Landenberger, Nana A., and Mark W. Lipsey. 2005. The
reports, interacts with those reports to determine the ap- Positive Effects of Cognitive-Behavioral Programs for
propriate coding on each item, and analyzes the resulting Offenders: A Meta-Analysis of Factors Associated with
data. The kinds of questions that can be addressed by syn- Effective Treatment. Journal of Experimental Crimino-
thesis of a given research literature are thus determined logy 1(4): 45176.
by the variables that the synthesist is able to code and the Lipsey, Mark W. 2003. Those Confounded Moderators in
kinds of analyses that are possible given the nature of the Meta-Analysis: Good, Bad, and Ugly. Annals of the Ame-
resulting data. rican Academy of Political and Social Science 587(May):
This chapter has examined the broad types of variables 6981.
potentially available from research studies, the kinds of Lsel, Friedrich, and Martin Schmucker. 2005. The Effecti-
questions that might be addressed using those variables, veness of Treatment for Sexual Offenders: A Comprehen-
and the general forms of analysis that investigate those sive Meta-Analysis. Journal of Experimental Criminology
questions. Its goal was to provide an overview that might 1: 11746.
help the prospective research synthesist select appropri- Low, K. S. Douglas, Mijung Yoon, Brent W. Roberts, and
ate variables for coding and plan a probing and interest- James Rounds. 2005. The Stability of Vocational Interests
ing analysis of the resulting database. Contemporary re- from Early Adolescence to Middle Adulthood: A Quantita-
search synthesis often neglects important variables and tive Review of Longitudinal Studies. Psychological Bulle-
analysis opportunities. As a consequence, we learn less tin 131(5): 71337.
from such work than we might. Even worse, what we Martin, Jose Luis R., Manuel J. Barbanoj, Thomas E. Schlaep-
learn may be erroneous if confounds and alternative anal- fer, Elinor Thompson, Victor Perez, and Jaime Kulisevsky.
ysis models have been insufficiently probed. Current 2003. Repetitive Transcranial Magnetic Stimulation for the
practice has only begun to explore the potential breadth Treatment of Depression: Systematic Review and Meta-
and depth of knowledge that research synthesis might analysis. British Journal of Psychiatry 182(6): 48091.
yield. Munafo, Marcus. R., Dawn L. Thiselton, Taane G. Clark,
and Jonathan Flint. 2006. Association of the NRG1 Gene
8.4 REFERENCES and Schizophrenia: A Meta-Analysis. Molecular Psychiatry
11(6): 53946.
Becker, Betsy J. 1986. Influence Again: An Examination of Najaka, Stacy S., Denise C. Gottfredson, and David B. Wil-
Reviews and Studies of Gender Differences in Social In- son. 2001. A Meta-Analytic Inquiry into the Relationship
fluence. In The Psychology of Gender: Advances Through Between Selected Risk Factors and Problem Behavior.
Meta-Analysis, edited by Janet S. Hyde and Marcia C. Prevention Science 2(4): 25771.
Linn. Baltimore, Md.: Johns Hopkins University Press. Netz, Yael, Meng-Jia Wu, Betsy J. Becker, and Gershon
Becker, Betsy. J. 2000. Multivariate Meta-Analysis. In Tenenbaum. 2005. Physical Activity and Psychological
Handbook of Applied Multivariate Statistics and Mathe- Well-Being in Advanced Age: A Meta-Analysis of Inter-
matical Modeling edited by Howard E. A. Tinsley and Ste- vention Studies. Psychology and Aging 20(2): 27284.
ven D. Brown. New York: Academic Press. Parker, Christopher, Boris N. Baltes, Scott A. Young,
Deffenbacher, Kenneth A., Brian H. Bernstein, and Steven D. Joseph W. Huff, Robert A. Altmann, Heather A. Lacost, and
Penrod. 2006. Mugshot Exposure Effects: Retroactive In- Joanne E. Roberts. 2003. Relationships Between Psycho-
terference, Mugshot Commitment, Source Confusion, and logical Climate Perceptions and Work Outcomes: A Meta-
Unconscious Transference. Law and Human Behavior Analytic Review. Journal of Organizational Behavior
30(3): 287307. 24(4): 389416.
158 CODING THE LITERATURE

Rohrbeck, Cynthia. A., Marika D. Ginsburg-Block, John W. Smith, Mary L., and Gene V. Glass. 1977. Meta-Analysis of
Fantuzzo, and Traci R. Miller. 2003. Peer-Assisted Lear- Psychotherapy Outcome Studies. American Psychologist
ning Interventions with Elementary School Students: A 32(9): 75260.
Meta-Analytic Review. Journal of Educational Psycho- Swaisgood, Ronald R., and David J. Shepherdson. 2005.
logy 95(2): 24057. Scientific Approaches to Enrichment and Stereotypes in
Rothstein, Hannah R., Alexander J. Sutton, and Michael Bo- Zoo Animals: Whats Been Done and Where Should We Go
renstein, eds. 2005. Publication Bias in Meta-Analysis: Next? Zoo Biology 24(6): 499518.
Prevention, Assessment and Adjustments. New York: John Wilson, Sandra J., Mark W. Lipsey, and Haluk Soydan. 2003.
Wiley & Sons. Are Mainstream Programs for Juvenile Delinquency Less
Shadish, William R. 1996. Meta-Analysis and the Explora- Effective with Minority Youth than Majority Youth? A
tion of Causal Mediating Processes: A Primer of Examples, Meta-Analysis of Outcomes Research. Research on Social
Methods, and Issues. Psychological Methods 1(1): 4765. Work Practice 13(1): 326.
9
SYSTEMATIC CODING

DAVID B. WILSON
George Mason University

CONTENTS
9.1 Introduction 160
9.2 Importance of Transparency and Replicability 160
9.3 Coding Eligibility Criteria 161
9.3.1 Study Features to be Explicitly Defined 161
9.3.1.1 Defining Features 161
9.3.1.2 Eligible Designs and Required Methods 162
9.3.1.3 Key Sample Features 162
9.3.1.4 Required Statistical Data 162
9.3.1.5 Geographical and Linguistic Restrictions 162
9.3.1.6 Time Frame 162
9.3.2 Refining Eligibility Criteria 163
9.4 Developing a Coding Protocol 163
9.4.1 Types of Information to Code 164
9.4.1.1 Report Identification 164
9.4.1.2 Study Setting 164
9.4.1.3 Participants 164
9.4.1.4 Methodology 165
9.4.1.5 Treatment or Experimental Manipulation(s) 165
9.4.1.6 Dependent Measures 165
9.4.1.7 Effect Sizes 166
9.4.1.8 Confidence Ratings 166
9.4.2 Iterative Refinement 166
9.4.3 Structure of the Data 168
9.4.3.1 Flat File Approach 168
9.4.3.2 Hierarchical or Relational File Approach 168
9.4.4 Coding Forms and Coding Manual 170

159
160 CODING THE LITERATURE

9.5 Coding Mechanics 171


9.5.1 Paper Forms 171
9.5.2 Coding Directly into a Database 171
9.6 Disadvantages of a Spreadsheet 172
9.7 Training Coders 173
9.7.1 Practice Coding 173
9.7.2 Regular Meetings 173
9.7.3 Using Specialized Coders 174
9.7.4 Assessing Coding Reliability 174
9.7.5 Masking Coders 174
9.8 Common Mistakes 175
9.9 Conclusion 175
9.10 References 175

9.1 INTRODUCTION 9.2 IMPORTANCE OF TRANSPARENCY


AND REPLICABILITY
A research synthesist has collected a set of studies that
address a similar research question and wishes to code A bedrock of the scientific method is replication. Re-
the studies to create a dataset suitable for meta-analysis. search synthesis exploits this feature by combining re-
This task is analogous to interviewing, but a study rather sults across a collection of studies that are either pure
than a person is interviewed. Interrogating might be a replications of one another or conceptual replications
more apt description, though it is usually the coder who (that is, studies examining the same basic empirical rela-
loses sleep because of the study and not the other way tionship, albeit with different operationalizations of key
around. The goal of this chapter is to provide synthesists constructs and variation in method). A detailed methods
with practical advice on designing and developing a cod- section that is part of a research write-up allows others to
ing protocol suitable to this task. A coding protocol is critically assess the procedures used (transparency) and
both the coding forms, whether paper or computerized, to conduct an identical or similar study to establish the
and the coding manual providing instructions on how to veracity of the original findings (replicability). Research
apply coding form items to studies. synthesis should in turn have these features, allowing oth-
A well-designed coding protocol will describe the ers to duplicate the synthesis if they question the findings
characteristics of the studies included in the research or wish to augment it with additional studies. This should
synthesis and will capture the pertinent findings in a fash- not be interpreted to imply that research synthesis is
ion suitable for comparison and synthesis using meta- merely the technical application of a set of procedures.
analysis. The descriptive aspect typically focuses on ways The numerous decisions made in the design and conduct
in which the studies may differ from one another. For of a research synthesis require thoughtful scholarship.
example, the research context may vary across studies. In A high quality research synthesis, therefore, explicates
addition, studies may use different operationalizations of these decisions clearly enough to allow others to fully un-
one or more of the critical constructs. Capturing varia- derstand the basis for the findings and to be able to repli-
tions in settings, participants, methodology, experimental cate the synthesis.
manipulations, and dependent measures is an important The coding protocol synthesists develop is a critical
objective of the coding protocol, not only for careful aspect of this transparency and replicability. Beyond de-
description but also for use in the analysis to explain vari- tailing how studies will be characterized, it documents
ation in study findings (see chapters 7 and 8, this vol- the procedures used to extract information from studies
ume). More fundamentally, the coding protocol serves as and gives guidance on how to translate the narrative in-
the foundation for the transparency and replicability of formation of a research report into a structured and (typi-
the synthesis. cally) quantitative form.
SYSTEMATIC CODING 161

Coding studies is challenging and requires subjective effectiveness of a specific drug relative to a placebo for a
decisions. A decision that may on the surface seem clearly specified patient population. At the other end of
straightforward, such as whether a study used random as- the continuum would be a research syntheses focused on
signment to experimental conditions, often requires judg- broad research questions such as the effect of class size
ments based on incomplete information in the written on achievement or juvenile delinquency interventions on
reports. The goal of a good coding protocol is to provide criminal behavior. In each case, the eligibility criteria
guidance to coders on how to handle these ambiguities. should clearly elucidate the boundaries for the synthesis,
Appropriately skilled and adequately trained coders should indicating the characteristics studies must have to be in-
be able to independently code a study and produce es- cluded and those that would exclude them.
sentially the same coded data. Transparency and replica-
bility are absent if only the author of the coding protocol 9.3.1 Study Features to Be Explicitly Defined
can divine its nuances. In the absence of explicit coding
criteria, a researcher who claims to know a good study Eligibility criteria should address several issues, includ-
when he or she sees one, or to know a good measure of ing the defining feature of the studies of interest, the eli-
leadership style by looking at the items, or to determine gible research designs, any sample restrictions, required
whether an intervention was implemented well by read- statistical data, geographical and linguistic restrictions,
ing the report may be a good scholar, but is not engaged and any time frame restriction. I briefly elaborate on each
in scientific research synthesis. Because it is impossible of these issues.
to know what criteria were invoked in making these judg- 9.3.1.1 Defining Features Research questions typi-
ments, the decisions lack both transparency and replica- cally address a specific empirical relationship or empiri-
bility. The expertise of an experienced scholar should be cal finding. This relationship or finding is the defining
reflected in the coding protocol (the specific coding items feature of the studies included in the synthesis. Eligibility
and decision rules for how to apply those items to specific criteria should clearly elaborate on the nature of the em-
studies) and not in opaque coding that others cannot scru- pirical finding that defines the parameters for study inclu-
tinize. sion. For example, a meta-analysis of the effectiveness of
an intervention, such as cognitive-behavioral therapy for
9.2 CODING ELIGIBILITY CRITERIA childhood anxiety, needs to clearly delineate the charac-
teristics that an intervention must have to be considered
The studies a synthesist identifies through the search and cognitive-behavioral. Similarly, a meta-analysis of group
retrieval process should be coded against explicit eligibil- comparisons, such as the relative math achievement of
ity criteria, though this is not technically part of the cod- boys and girls, needs to define the parameters of the
ing protocol. Assessing each retrieved study against the groups being compared and the dependent variables of
eligibility criteria and recording this information on a interest. Earlier examples focused on an empirical rela-
screening form or into an eligibility database is an impor- tionship, such as the effect of an intervention or social
tant record of the nature of studies excluded from a re- policy on one or more outcomes or the correlation be-
search synthesis. With this information, the synthesist tween an employment test and job performance. In some
can report on the number of studies excluded and the rea- cases, a single statistical parameter may be of interest,
sons for the exclusions. These data not only serve an in- such as the prevalence rate of alcohol or illegal drugs in
ternal audit function, but are also useful when a synthesist homicide victims, though meta-analyses of this type are
is responding to a colleague about why a particular study less common.
was not included in the synthesis. Specifying the defining features of the eligible studies
The eligibility criteria should flow naturally from the involves specifying both the nature of an independent
research question or purpose of the synthesis (see chap- variable, such as an intervention or employment test, and
ter 2, this volume). Your research question may be narrow one or more dependent variables. The level of detail
or broad, and this will be reflected in the eligibility crite- needed to adequately define these will vary across syn-
ria. For example, a research synthesis might focus nar- theses. The more focused your synthesis, the easier it will
rowly on social psychological laboratory studies of the be to clearly specify the construct boundaries. Broad syn-
relationship between alcohol and aggressive behavior theses will require more information to establish the
that use the competitive reaction time paradigm or on the guidelines for determining which studies to include and
162 CODING THE LITERATURE

which to exclude. In some research areas, it is easier to cases are to be handled. The researchers could decide to
specify the essential features of the independent and de- exclude these studies, to include them only if the results
pendent variables as separate criteria. for youth age twelve to eighteen are presented separately,
9.3.1.2 Eligible Designs and Required Methods or to include them if the proportion of nineteen- and
What is the basic research design that the studies have in twenty-year-olds does not exceed some threshold value.
common? If multiple design types are allowable, what 9.3.1.4 Required Statistical Data A frustrating as-
are they? A research synthesis of intervention studies pect of conducting a meta-analysis is missing data. It is
may be restricted to experimental (randomized) designs common to find an otherwise eligible study that does not
or may include quasi-experimental comparison group de- provide the statistical data needed for computing an ef-
signs. Similarly, a synthesis of correlational studies may fect size. With some work it is often possible to reconfig-
be restricted to cross-sectional designs, or only longitudi- ure the data provided to compute an effect size, or at least
nal designs, or it may include both types. Some research a credible estimate. However, there are times when the
domains have a plethora of designs examining the same necessary statistical data are simply not provided.
empirical relationship. The eligibility criteria need to spec- Eligibility criteria must indicate how to handle studies
ify which of these designs will be included and which with inadequate statistical data. There are several possi-
will not, and justify both inclusion and exclusion. bilities: excluding the study from the synthesis, including
It is easy to argue that only the most rigorous research it (either in the main analysis or as a sensitivity analysis;
designs should be included in a synthesis. There are see chapter 22, this volume) with some imputed effect
trade-offs, however, between rigor and inclusiveness and size value, or including it in the descriptive aspects of the
these trade-offs need to be evaluated within the context of synthesis but excluding it from the effect size analyses.
the research question and the broader goals of the synthe- The advantages and disadvantages of different options is
sis. For example, external validity may be reduced by in- beyond the scope of this chapter but missing data are
cluding only well-controlled experimental evaluations of dealt with elsewhere in this volume (see chapter 21).
a social policy intervention, as the level of control needed 9.3.1.5 Geographical and Linguistic Restrictions It
to mount such a study may change the nature of the inter- is important to specify any geographical or linguistic re-
vention and the context within which the intervention oc- strictions for the synthesis. For example, will only Eng-
curs. In this situation, quasi-experimental studies may lish language publications be considered? Does the locale
provide valuable information about the likely effects of for the study matter? Juvenile delinquency is a universal
the social policy in more natural settings. problem: there are juvenile delinquents in every country
9.3.1.3 Key Sample Features This criterion ad- and region of the world. However, the meaning of delin-
dresses the critical features of the sample or study par- quency and the societal response to it are culturally em-
ticipants. For example, are only studies involving humans bedded. As such, a research synthesis of juvenile delin-
eligible? In conducting a research synthesis of the effects quency programs must consider the relevance of studies
of alcohol on aggressive behavior, Mark Lipsey and his conducted outside the synthesists cultural context. Any
colleagues found many studies involving fishintoxicated geographical or linguistic restriction should be based on
fighting fish to be more precise (1997). Most syntheses of theoretical and substantive issues. Many non-English jour-
social science studies will not need to specify that they nals provide an English abstract, enabling the identifica-
are restricted to studies of humans, but may well need to tion of potentially relevant studies in these journals by
specify which age groups are eligible and other charac- those who are monolinguistic. Free online translation pro-
teristics of the sample, such as diagnostic status (high on grams may also be useful, at least for assessing eligibility
depression, low achievement in math, involved in drug (such as www.freetranslation.com).
use, adjudicated delinquent) and gender. A decision rule 9.3.1.6 Time Frame It is important that a synthesist
may need to be developed for handling studies with specify the timeframe for eligible studies. Furthermore,
samples that include a small number of participants out- the basis for this timeframe should be theoretical rather
side the target sample. For example, studies of juvenile than arbitrary. Issues to consider are whether studies
delinquency programs are not always clearly restricted to before a particular date generalize to the current context
youths age twelve to eighteen. Some studies include a and whether constructs with similar names have shifted
small number of nineteen- and twenty-year-olds. The in meaning sufficiently over time to reduce comparabil-
sample criterion should specify how these ambiguous ity. Similarly, social problems may evolve in ways that
SYSTEMATIC CODING 163

reduce the meaningfulness of older research to the pres- 9.4 DEVELOPING A CODING PROTOCOL
ent. In some research domains, the decision regarding a
timeframe restriction is easy: research on the topic began Developing a good coding protocol is much like develop-
at a clearly defined historical point and only studies from ing a good survey questionnaire: it requires a clear delin-
that point forward are worth considering. In other research eation of what you wish to measure and a willingness to
domains, a more theory-based argument may need to be revise and modify initial drafts. Before beginning to write
put forth to justify a somewhat arbitrary cut point. a coding protocol, the synthesist should list the constructs
An important issue to consider is whether the year of or study characteristics that are important to measure,
publication or year the data were collected is the critical much as one would when constructing a questionnaire.
issue. For some research domains, the two can be quite Many of these constructs may require multiple items. An
different. Unfortunately, it is not always possible to deter- overly general item is difficult to code reliably, as the
mine when data were collected, necessitating a decision generality allows each coder to interpret the item differ-
rule for these cases. ently. For example, asking coders to evaluate the internal
I have often seen published meta-analyses that restrict validity of a study on a five-point scale is a more difficult
the timeframe to studies published after another meta- coding task than asking a series of specific questions re-
analysis or widely cited synthesis. The logic seems to be lated to internal validity, such as the nature of assignment
to establish what the newer literature says on the topic. to conditions, extent of differential attrition, the use of
This makes sense only if the research in this area changed statistical adjustments to improve baseline equivalence,
in some meaningful way following the publication of the and the like. One can, of course, create composite scales
synthesis. The goal should be to exclude only studies from the individually coded items (but for cautions re-
based on some meaningful substantive or methodological garding the use of quality scales, see chapter 7, this vol-
basis, not an arbitrary one such as the date of an earlier ume). After delineating the study characteristics of inter-
synthesis. est, a synthesist should draft items for the coding protocol
Note that publication type has been excluded as an and circulate it to other research team members or re-
eligibility dimension in need of specification. A quality search colleagues, particularly those with substantive
research synthesis will include all studies meeting the ex- knowledge of the research domain being synthesized, for
plicit eligibility criteria independent of publication status. feedback. The protocol should be refined based on this
For a complete discussion of the publication selection feedback.
bias issue, see chapters 6 and 23, this volume. Sharon Brown, Sandra Upchurch, and Gayle Acton
proposed an alternative approach that derives the coding
9.3.2 Refining Eligibility Criteria categories from the studies themselves (2003). Their
method begins by listing the characteristics of the studies
It can be difficult to anticipate all essential dimensions into a matrix reflecting the general dimensions of interest,
that must be clearly specified in the eligibility criteria. such as participant type, variations in the treatment, and
I have had the experience of evaluating a study for inclu- the like. This information is then examined to determine
sion and found that it clearly does not fit with the litera- how best to categorize studies on these critical dimen-
ture being examined yet it meets the eligibility criteria. sions. That is, rather than start with a priori notions of
This generally arises if the criteria are not specific what to code based on the synthesists ideas about the lit-
enough in defining the essential features of the studies. erature, Brown and colleagues recommended reading
In these cases, the criteria can be modified and reapplied through the studies and extracting descriptive information
to studies already evaluated. It is important, however, about the essential characteristics of interest. The actual
that any refinement not be motivated by a desire to in- items and categories of the coding protocol are then based
clude or exclude a particular study based on its findings. on this information. This approach may be particularly
It is for this reason that the Cochrane Collaboration, an helpful in developing items designed to describe the more
organization devoted to the production of systematic re- complex aspects of studies, such as treatment variations
views in health care (see www.cochrane.org for more and operationalizations of dependent measures. This ap-
information), requires that eligibility criteria be speci- proach is also better suited to smaller syntheses.
fied beforehand and not modified once study selection It is good to think of a coding protocol in much the same
has begun. way as any measure developed for a primary research
164 CODING THE LITERATURE

study. Ideally, it will be valid and reliablemeasure what research synthesis will be an independent research study.
it is intended to measure and do so consistently across As such, the synthesist needs to create a system that al-
studies. Validity and reliability can be enhanced by devel- lows each manuscript and each research study to be
oping coding items that require as little inference as pos- tracked. The coding forms should record basic informa-
sible on the part of the coders, yet allows coders to record tion about the manuscript, such as the authors names,
the requested information in a way that corresponds to type of manuscript (journal article, dissertation, technical
how it is typically described in the literature. The amount report, unpublished manuscript, and so on), year of pub-
of inference that a coder needs to make can be reduced by lication, and year or years of data collection.
breaking apart the coding of complex study characteris- 9.4.1.2 Study Setting Research is always conducted
tics into a series of focused and simpler decisions. For in a specific context and a synthesist may wish to code
closed-ended coding items (that is, items with a fixed set aspects of this setting. For example, it may be of value to
of options), the correspondence between the items and record the geographic location where the study was con-
manner in which they are described in the literature can ducted, particularly if the synthesis includes studies from
be increased by developing the items from a sample of different countries or continents. The degree of specific-
studies. An open-ended question can also be used. With ity with regard to geographic location will depend on the
this approach, coders record the relevant information as it research question and nature of the included studies. Also
appears in the study report and specific coding items or of interest might be the year or years of data collection. In
categories are created from an examination of these open- some research areas, such as intervention studies, the
ended responses. research may vary in the degree of naturalness of the re-
It is important to keep in mind the purpose that each search setting. For example, an experimental test of child
coded construct serves as part of the research synthesis, psychotherapy may occur entirely in a laboratory, such as
because it is possible to code too many items. Some con- the research lab of a clinical psychology professor, and
structs may be important to code simply for their descrip- rely on participants recruited through newspaper ads, or
tive value, providing a basis for effectively summarizing be set in a clinic and involve participants seeking mental
the characteristics of the collection of studies. Other con- health services (for an example of a meta-analysis that
structs are primarily coded for use as moderators in effect compared the results from childrens psychotherapy stud-
size analyses. The number of possible moderator analy- ies in the lab versus the mental health clinics, see Weisz,
ses are often limited unless the meta-analysis has a large Weiss, and Donenberg 1992). Other potentially interest-
number of studies. Thus it is wise to consider carefully ing contextual issues are the academic affiliation of the
the potential value of each item you plan to coded. You researcher (such as disciplinary affiliation, in-house re-
can always return to the studies and code an additional searcher or outside academic, and the like) and source of
item if it becomes necessary. funding, if any. Each of these contextual issues may be
Although the specific items included in the coding pro- related in some way to the findings. For example, re-
tocol of each research synthesis vary, there are several searchers with a strong vested interest in the success of
general categories that most address: report identifica- the program may report more positive findings than other
tion, study setting, participants, methodology, treatment researchers.
or experimental manipulation, dependent measures, and 9.4.1.3 Participants Only in a very narrowly focused
effect size data. I briefly discuss the types of information research synthesis will all of the subject samples be iden-
you might consider for each of these categories. tical on essential characteristics. The coding protocol
should therefore include items that address the variations
9.4.1 Types of Information to Code in participant samples that occur across the included stud-
ies. A limit on the ability to capture essential subject fea-
9.4.1.1 Report Identification Although it is rather tures is the aggregate nature of the data at the study level
mundane, you will want to assign each study an identifi- and inadequate reporting. For example, age of the study
cation code. A complication is that some research studies sample is likely to be of interest. Study authors may report
are reported in multiple manuscripts (such as a journal this information in a variety of ways, including the age
article and technical report or multiple journal articles), range, the school grade or grades, the mean age, the me-
and some written reports present the results from multiple dian age, or they may make no mention of age. The cod-
studies or experiments. The primary unit of analysis for a ing protocol needs to be able to accommodate these
SYSTEMATIC CODING 165

variations and have clear guidelines for coders on how to involved in assessing methodological quality are ad-
characterize the age of the study sample. For example, a dressed in detail elsewhere in this volume (see chapter 7).
synthesist may develop guidelines that indicate that if the 9.4.1.5 Treatment or Experimental Manipulation(s)
mean age is not reported, it should be estimated from the If studies involve a treatment or experimental manipula-
age range or school grade (such as, a sample of kinder- tion, then the synthesist will want to develop coding items
garten students in the United States could be coded as to fully capture any differences between the studies in
having an average age of five and a half). The specific these features. It is useful to have one or more items that
sample characteristics chosen to code will depend on the capture the general type or essential nature of the treat-
research question and the nature of the studies included, ment or experimental manipulation in addition to more
but other variables to consider are gender mix, racial mix, detailed codes. For example, the studies included in our
risk or severity level, any restrictions placed on subject meta-analysis of experimental studies on the effects of
selection (such as, clinical diagnosis), and motivation for alcohol on aggressive behavior used one of two basic
participation in the study (such as, course credit, volun- types of experimental manipulations, a competitive-reac-
teers, court mandated, paid participants, and so on). It is tion time paradigm or a teacher-learner paradigm (Lipsey
also important to consider whether summary statistics ad- et al. 1997). An initial coding item captured this distinc-
equately measure characteristic of interest and whether tion and additional items addressed other nuances within
they are potentially biased. The mean age for samples of each paradigm. As an additional example, a meta-analy-
college students, for example, may be skewed by a small sis of school-based drug and delinquency prevention pro-
number of older students. Knowing how best to capture grams first categorized the programs into basic types,
sample characteristics will depend on substantive knowl- such as instructional, cognitive-behavioral or behavioral
edge of the research domain being synthesized. modeling, and the like (Wilson, Gottfredson, and Najaka
9.4.1.4 Methodology Research studies included in a 2001). Additional coding items were used to evaluate the
research synthesis will often vary in certain methodologi- presence or absence of specific program elements. Other
cal features. Research synthesists will want to develop issues to consider are the dose (length) and intensity of
coding items to reflect that variability so that they can treatment, who delivers the treatment, and the integrity of
assess the influence of method variation on observed re- the treatment or experimental manipulation. In my expe-
sults. The importance of methodological features in ex- rience meta-analyzing treatment effectiveness research,
plaining variability in effects across studies has been well integrity of treatment is inadequately reported by study
established empirically (Lipsey and Wilson 1993; Shad- authors. However, it may still be of interest to document
ish and Ragsdale 1996; Weisburd, Lum, and Petrosino what is available and if possible examine whether effect
2001; Wilson and Lipsey 2001). size is related to the integrity of the treatment or other
The focus should be on the ways in which the studies experimental manipulation.
in the sample differ methodologically. Things to consider 9.4.1.6 Dependent Measures Studies will have one
coding include the basic research design, nature of as- or more dependent measures. Research syntheses of cor-
signment to conditions, subject and experimenter blind- relational research will have at least two measures (either
ing, attrition from baseline (both overall and differential), two dependent variables or a dependent and an indepen-
crossovers, dropouts, other changes to assignment, the dent variable) per study but often many more. Not all
nature of the control condition, and sampling methods. measures used in a study will necessarily be relevant to
Each research domain will have unique method features the research question. Therefore, the coding manual needs
of interest, and as such the features that should be coded to clearly elaborate a decision rule for determining which
will differ. For example, in conducting a meta-analysis of measures to code and which to ignore.
analog studies of the relationship between alcohol con- For each dependent measure to be coded, a synthesist
sumption and aggressive behavior, it would be important will want to develop items that capture the essential con-
to code the method used to keep the participants unaware struct measured and the specific measure used. Note that
of their experimental condition. In other meta-analyses, different studies may operationalize a common construct
coding the type of quasi-experimental design or whether differently. For example, the construct of depression
the study used a between-subjects or within-subjects de- may be operationalized as scores on the Hamilton Rating
sign might be essential. It is also critical to code items Scale for Depression, the Beck Depression Inventory, or
specifically related to methodological quality. The issues a clinician rating. A coding protocol should be designed
166 CODING THE LITERATURE

to capture both the characteristics of the construct and its ures for each condition. This does not exhaust the possi-
operationalization. It also may include other information bilities but captures the common configurations for the
about the measure, such as how the data were collected computation of the standardized mean difference (see
(such as self-report, observer rating, or clinical interview), Lipsey and Wilson 2001). For the remaining instances
psychometric properties (such as reliability and validity), based on more complete situations, I recommend simply
whether the measure was a composite scale or based on a keeping track of the computation whether electronically
single item, and the scale of the measure (that is, dichoto- or on paper. To assist in data clean-up, it is useful to re-
mous, discrete ordinal, or continuous). In many situations cord the page number of the manuscript where the effect
it is also essential to know the timing of the measure- size data can be found. Figure 9.1 shows an example of
ment relative to the experimental manipulation or treat- an effect size coding form for a meta-analysis based on
ment (such as baseline, three months, six months, twelve the standardized mean difference. Variations of this form
months, and so on). For measures of past behavior, the could easily be created for other effect size types, such as
reporting timeframe would also be considered relevant. the correlation coefficient, odds-ratio, and risk-ratio.
Participants in drug prevention studies, to take one ex- 9.4.1.8 Confidence Ratings Synthesists will regu-
ample, are often asked to self-report their use of drugs larly be frustrated by the inadequate reporting of infor-
over the past week, past month, and past year. mation in studies. Often this will simply mean they must
9.4.1.7 Effect Sizes The heart of a meta-analysis is code an item as cannot tell or missing. However,
the effect size, because it encodes the essential research there are times when information is provided but is in
findings from the studies of interest. Often, the data re- some way inadequate. A study, to take one example, may
ported in a study need to be reconfigured to compute the report the overall sample size but not the sample sizes for
effect size of interest. For example, a study may only re- the individual groups. Based on the method of construct-
port data separately for boys and girls, but you might be ing the conditions, it may be reasonable to assume that
interested in the overall effect across both genders. In conditions were of roughly equal size. In a meta-analysis
such cases, you will need to properly aggregate the re- of correctional boot camps I conducted with colleagues
ported data to compute your effect size. Similarly, a study (MacKenzie, Wilson, and Kider 2001), several studies
may use a complex factorial design, of which you are failed to mention whether the sample was all boys, all
only interested in a single factor. As such, you may need girls, or a mix of the two. Given the context, I was reason-
to compute the marginal means if they are not reported. ably certain that the samples were all boys. A synthesist
In treatment effectiveness research, some primary authors may wish to track the degree of uncertainty or confidence
will remove treatment dropouts from the analysis. How- in the information coded for key items, such as the data
ever, you may be interested in an intent-to-treat analysis on which the effect size is based. Robert Orwin and David
and therefore may need to compute a weighted mean be- Cordray recommended the use of confidence judgment or
tween the treatment completers and treatment dropouts to ratings (1985). For example, coders could rate on a three-
obtain the mean for those assigned to the treatment group. point scale their confidence in the data on which the ef-
The coding protocol cannot anticipate all of the potential fect size is based, as shown in figure 9.1.
statistical manipulations you may need to make. A copy 9.4.1.9 Iterative Refinement An important compo-
of these computations should be kept for reference, either nent of good questionnaire development is pilot testing.
saved as a computer file, such as a spreadsheet, or on a Similarly, pilot tests of the coding protocol should be
sheet of paper attached to the completed coding protocol. conducted on a sample of studies to identify problem
I have found it helpful to code the data on which the items and lack of fit between the coding categories and
effect size is based as well as on the computed effect size. the characteristics of the studies. Synthesists should not
For example, for the standardized mean difference effect simply start with the studies at the top of the pile or front
size, this would mean recording the means, standard de- section of the filing cabinet. These studies are likely to be
viations or standard errors, and sample sizes for each those that were easy to retrieve or were from the same
condition. Other common data on which this effect size is bibliographic source, such as PsychLit or ERIC. Synthe-
based include the t-test and sample sizes and the exact sists want to ensure that the pilot test of the coding pro-
p-value for the t-test and sample sizes. If the synthesists tocol is conducted on studies that represent the full range
also are coding dichotomous outcomes, they would re- of characteristics of the sample, to the extent possible.
cord the frequencies or proportions of successes or fail- Although there is no need to throw out the data coded as
SYSTEMATIC CODING 167

EFFECT SIZE LEVEL CODING FORM


Code this sheet separately for each eligible effect size.

Identifying Information:
1. Study (document) identifier StudyID |___|___|___|
2. Treatment-Control identifier TxID |___|___|___|
3. Outcome (dependent variable) identifier OutID |___|___|___|
4. Effect size identifier ESID |___|___|___|
5. Coders initials ESCoder |___|___|___|
6. Date coded ESDate |___ /___ /___|

Effect Size Related Information:


7. Pre-test, post-test, or follow-up (1  pre-test; 2  post-test; 3  follow-up) ES_Type |___|
8. Weeks Post-Treatment Measured (code 999 if cannot tell) ES_Time |___|___|___|
9. Direction of effect (1  favors treatment; 2  favors control; 3  neither) ESDirect |___|

Effect Size DataAll Effect Sizes:


10. Treatment group sample size ES_TxN |___|___|___|___|
11. Control group sample size ES_CgN |___|___|___|___|

Effect Size DataContinuous Type Measures:


12. Treatment group mean ES_TxM |___|___.___|___|
13. Control Group mean ES_CgM |___|___.___|___|
14. Are the above means adjusted (e.g., ANCOVA adjusted)? (1  yes, 0  no) ES_MAdj |___|
15. Treatment group standard deviation ES_TxSD |___|___.___|___|
16. Control group standard deviation ES_CgSD |___|___.___|___|
17. t-value ES_t |___.___|___|___|

Effect Size DataDichotomous Measures:


18. Treatment group; number of failures (recidivators) ES_TxNf |___|___|___|___|
19. Control group; number failures (recidivators) ES_CgNf |___|___|___|___|
20. Treatment group; proportion failures ES_TxPf |___.___|___|
21. Control group; proportion failures ES_CgPf |___.___|___|
22. Are the above proportions adjusted for pretest variables? (1  yes; 0  no) ES_PAdj |___|
23. Logged odds-ratio ES_LgOdd |___.___|___|___|
24. Standard error of logged odds-ratio ES_SELgO |___.___|___|___|
25. Logged odds-ratio adjusted for covariates? (1  yes; 0  no) ES_OAdj |___|
Figure 9.1 Sample Effect-Size Level Coding Form
source: Authors compilation.
168 CODING THE LITERATURE

Effect Size DataHand Calculated:


26. Hand calculated d-type effect size ES_Hand1 |___.___|___|___|
27. Hand calculated standard error of the d-type effect size ES_Hand2 |___.___|___|___|

Effect Size Data Location


28. Page number where effect size data found ES_Pg |___|___|___|___|

Effect Size Confidence


29. Confidence in effect size value ES_Conf |___|
1. Highly Estimatedhave N and crude p value only, e.g., p  .10, or other limited information
2. Some Estimationhave complex but complete statistics; some uncertainty about precision of effect size
or accuracy of information
3. No Estimationhave conventional statistical information and am confident in accuracy of information
Figure 9.1 (Continued)

part of the pilot testing, synthesists do need to recode for situations in which the synthesists know the maximum
any items that were refined or added as a result of the number of effect sizes per study that will be coded and
pilot test. can place each effect size in a distinct category. For ex-
ample, suppose synthesist are interested in meta-analyzing
9.4.3 Structure of the Data a collection of studies examining a new teaching method.
The outcomes of interest are math and verbal achieve-
A complexity of meta-analytic data is that it is by defini- ment as measured by a well-known standardized achieve-
tion nested or hierarchical. The coding protocol must ment test. They might develop a coding protocol with one
allow for this complexity. Two general approaches to set of effect size items that addresses the math achieve-
handling the nested nature of the data can be adapted to ment outcome and another set that addresses the verbal
any situation: a flat-file approach and a hierarchical or achievement outcome. The advantage of this approach is
relational file approach. In the flat-file approach, the syn- that all the data are contained within one file. Also, with
thesists specify or know beforehand the nature and extent one row per study, by design, only one effect size per
of the nesting and repeat the effect size variables the nec- study can be included in any single analysis. The disad-
essary number of times. For example, they will have a set vantage is that the flat file approach becomes cumber-
of variables for the first effect size of interest, say con- some as the number of effect sizes per study increases:
struct 1, and another set of variables for the second effect wide data tables are difficult to work with. Also, any sta-
size of interest, say construct 2. This works well for a tistical manipulation of effect sizes, such as the applica-
limited and highly structured nesting. In my experience, tion of the small sample size bias correction to standard-
however, the number of effect sizes of interest per study ized mean difference effect sizes needs to be performed
is typically not known before hand. A hierarchical or re- separately for each additional effect size, because each
lational data structure creates separate data tables for effect size within a study is contained in a different col-
each level of the hierarchy. In its simplest form, this umn or variable within the dataset.
would involve a study level data table with one row per 9.4.3.2 Hierarchical or Relational File Approach
study and a related effect size level data table with one The hierarchical or relational file approach creates sepa-
row per effect size. A study identifier links the two data rate data tables for each conceptually distinct level of
tables. I illustrate each approach. data. Data that can be coded only a single time for a given
9.4.3.1 Flat File Approach The flat file approach study, such as the characteristics of the research methods,
creates a single data table with one row per study and as sampling procedures, study context, and experimental
many columns as variables or coded items. It is suitable manipulation, are entered into a single study-level data
SYSTEMATIC CODING 169

STUDY_ID PUBYEAR MEANAGE TX_TYPE


001 1992 15.5 2
002 1988 15.4 1
003 2001 14.5 1

STUDY_ID DV_ID CONSTRUCT SCALE


001 01 2 2
001 02 6 1
002 01 2 2
003 01 2 2
003 02 3 1
003 03 6 1

STUDY_ID ES_ID DV_ID FU_MONTHS TX_N CG_N ES


001 02 01 0 44 44 .39
001 02 01 6 42 40 .11
001 03 02 0 44 44 .10
001 04 02 6 42 39 .09
002 01 01 0 30 30 .34
002 02 01 12 30 26 .40
003 01 01 6 52 52 .12
003 02 02 6 51 50 .21
003 03 03 6 52 49 .33

Figure 9.2 Hierarchical Data Structure


source: Authors compilation.
note: Three data tables, one for study level data, one for dependent variable level data, and one for
effect size level data. The boxes and arrows show the related data for study 001. The variables are:
STUDY_ID  study identifier; PUBYEAR  publication year; MEANAGE  mean age of sample;
TX_TYPE  treatment type; DV_ID  dependent variable identifier; CONSTRUCT  construct mea-
sured by the dependent variable; SOURCE  source of measure; ES_ID  effect size identifier;
FU_MONTHS  months post-intervention; TX_N  treatment sample size; CG_N  control group
sample size; ES  effect size.

table. Data that change with each effect size are entered simplifies coding and eases data cleanup. In a more com-
into a separate data table with one row per effect size. plex meta-analysis, a separate sample level data table is
This data structure can accommodate any number of ef- often required. This allows for the coding of subgroup
fect sizes per study. effects, such as males and females.
In my experience, additional data tables are often use- This approach is shown in figure 9.2 for a meta-analy-
ful: a study level, a dependent variable (measurement) sis of wilderness challenge programs for juvenile delin-
level, and an effect-size level. In many research domains, quents (Lipsey and Wilson 2001). These studies may
a single dependent variable may be measured at multiple each include any number of dependent measures and
time points, such as at baseline, posttest, and follow-up. each may be measured at multiple time points. The cod-
Rather than code the features of the dependent variable ing protocol was divided into three sets of items: those
multiple times for each effect size, one can separate the that are at the study level, that is, they do not change for
dependent variable coding items into separate data tables different outcomes; those that describe each dependent
and link that information to the effect-size data table with variable used in the study; and those that encode each
a study identifier and a dependent variable identifier. This effect size. The effect size data table includes a study
170 CODING THE LITERATURE

Sample Coding Items from Coding Form


____ ____ . ____ ____ 4. Mean age of sample [MEANAGE]

____ 19. Occur in a wilderness setting?


(1  yes; 0  no) [WILDNESS]

Sample Coding Items from Coding Manual


4. Mean age of sample. Specify the approximate or exact mean age at the beginning of
the intervention. Code the best information available; estimate mean age from grade
levels if necessary. If mean age cannot be determined, enter 99.99.
19. Did the program occur in a wilderness setting? Code as yes if the activities took
place outdoors, even if the participants were camping in cabins or other buildings.
Code as no if the activities that took place indoors or used man-made contraptions.
(1 = yes; 0 = no)

Figure 9.3 Sample Coding Items


source: Authors compilation.
note: Variable names are shown in brackets. (Extracted from example used in Lipsey and Wilson, 2001,
of a meta-analysis of challenge programs for juvenile delinquents.)

identifier (STUDY_ID) and a dependent variable identi- 9.4.4 Coding Forms and Coding Manual
fier (DV_ID) that link each effect size row with the ap-
propriate data in both the study level and dependent vari- It is often advantageous to develop both a coding manual
able level data tables. This approach also works well for and coding forms. The former is a detailed documenta-
meta-analyses of correlational studies. However, rather tion of how to code each item and includes notes about
than each effect size being associated with a single de- decision rules developed as the synthesis team codes
pendent variable, it will be associated with two. In this studies. A coding form, however, is designed for effi-
way, you can efficiently code all correlations in a correla- ciently recording coded data. Figure 9.3 provides an ex-
tion matrix by first coding the characteristics of each ample of the contents of a code manual for two items and
measured variable and then simply indicate which pair of the corresponding items on the coding form. Note that the
variables is associated with each correlation. coding form is decidedly light on information, providing
The hierarchical approach has the advantage of flexi- just enough information for trained coders to know where
bility, simplified coding, and easier data cleanup. The dis- to record relevant study data.
advantage is the need to manipulate the data tables prior There are several advantages to this approach. First,
to analysis. The steps involved are merging the individual the synthesists can annotate the coding manual as the
data tables into a single flat file, selecting a subset of ef- team codes studies and that information does not become
fect sizes for analysis, and confirming that the selection trapped in the coding form of a single study. Second, cod-
algorithm yields or results in a single effect size per study. ing is more efficient, because coders only need to consult
Because the effect size is in a single variable (one column the manual for cases where they are uncertain as to the
of the data table), it is far easier to create a composite ef- correct way to code an item for a given study. Finally,
fect size within each study to maintain the one effect size coders can condense the coding forms in a matrix layout,
per study restriction. For example, you may wish to per- as shown in figure 9.4, for efficient coding of dependent
form an analysis based on any achievement effect size measures and effect size data, or any other data items that
within a study, averaging multiple achievement effect may need to be coded multiple times for a single study.
sizes if they exist. This can easily be accomplished with a Note that in figure 9.4, four effect sizes from a single
hierarchical data structure. study can be coded on a single page, simplifying the
SYSTEMATIC CODING 171

Effect Size Data:


1. Study ID |___|___|___|
2. Outcome ID |___|___|___| |___|___|___| |___|___|___| |___|___|___|
3. ES ID |___|___|___| |___|___|___| |___|___|___| |___|___|___|
...
30. Treatment N |___|___|___| |___|___|___| |___|___|___| |___|___|___|
31. Control N |___|___|___| |___|___|___| |___|___|___| |___|___|___|
32. Treatment mean |___|___.___|___| |___|___.___|___| |___|___.___|___| |___|___.___|___|
33. Control mean |___|___.___|___| |___|___.___|___| |___|___.___|___| |___|___.___|___|
34. Adjusted? |___| |___| |___| |___|
35. Treatment SD |___|___.___|___| |___|___.___|___| |___|___.___|___| |___|___.___|___|
36. Control SD |___|___.___|___| |___|___.___|___| |___|___.___|___| |___|___.___|___|
37. t-value |___|___.___|___| |___|___.___|___| |___|___.___|___| |___|___.___|___|
Figure 9.4 Matrix Layout
source: Authors compilation.
note: Layout for a subset of items from an effect-size coding form allowing up to four effect sizes per
form.

transfer of data from tables in reports onto the coding coding forms. Comparing double-coding for a large syn-
form. thesis is also more tedious when using paper forms.
When creating paper coding forms, synthesists should
always include the variable names that will be used in the
9.5 CODING MECHANICS statistical software program on the forms. Memory fades
At a mechanical level, studies can be coded on paper cod- quickly and it can be a tedious task reconstructing which
ing forms or directly into the computer. This discussion variables in a data file corresponded to specific items in a
so far has presumed that paper coding forms are used. I coding protocol. I also recommend that synthesists place
recommend that even if synthesists do develop a data- the fields for recording the data either down the left or
base complete with data entry forms, they start with paper right margin of the coding forms. This has two advan-
coding forms as the template. Paper coding forms are tages. First, it allows easy visual inspect of the forms to
also easier to share with others who request a copy of ensure that each item has been completed. Second, data
your coding protocol. entry proceeds more smoothly as all data are in a column
in either the left or right margin, rather than scattered
around the page.
9.5.1 Paper Forms
Paper coding forms, such as the ones shown in figures 9.1 9.5.2 Coding Directly into a Database
and 9.4, have the advantage of being simple to create and
to use. There is also no need to worry about computer Databases provide an alternative to paper coding forms
backups or database corruption. If the research synthesis and allow direct coding of studies into a computer, elimi-
involves a small number of studies, such as twenty or nating the need to transfer data from paper forms to com-
fewer, paper coding forms are likely to be a good choice. puter files. Most any modern database program can be
However, with paper coding forms, coders must enter the used, such as FileMaker Pro and Microsoft Access.
data into a computer and manage all the completed paper There are several advantages to using a database for study
172 CODING THE LITERATURE

Figure 9.5 Database Coding Form


source: Authors compilation.

coding. First, synthesists can specify allowable values for of using a database are the time, effort, and technical
any given field (coded item), helping to enhance data ac- skills needed to create the database and design the data
curacy. Second, databases save data as coders work, help- entry forms. For any reasonably large meta-analysis,
ing reduce the likelihood of data loss. Third, the data in however, the result is generally well worth the effort.
most databases can easily be exported to any number of
statistical software programs. Fourth, the synthesist can 9.6 DISADVANTAGES OF A SPREADSHEET
create data entry forms that look similar to paper coding
forms, providing all of the same information about how Spreadsheets, such as MS Excel or OpenOffice.org Calc,
to code an item that might have on a paper coding form. are extremely useful programs. They are not, however,
A simple example of a coding form in a database pro- well suited for systematically coding meta-analytic data.
gram is shown in figure 9.5. Fifth, with a bit more effort, The spreadsheet framework is rows and columns with
synthesists can have the database compute effect sizes, at columns representing different variables and rows repre-
least for the more common statistical configurations. Last, senting different studies or effect sizes. This is similar to
data queries can be constructed to assess data consistency, the typical flat-file data table familiar to anyone accus-
aiding in the process of data cleanup. The disadvantages tomed to data analysis, and as such seems a natural fit
SYSTEMATIC CODING 173

with the task at hand. However, coding directly into a modifying coding items if necessary, and repeating this
spreadsheet, however, is different from basic data entry process until consistency across coders is achieved.
into a database program, where the data are already neatly Stocks eight steps are fairly straightforward:
organized on a survey form or other data instrument.
Even in such a case, data entry using a spreadsheet is An overview of the synthesis is given by the princi-
typically inefficient compared to using a good text editor pal investigator.
to create a simple ASCII data file or using a database pro- Each item on a form and its description in the code
gram. In coding studies, the order of items on a coding book is read and discussed.
form never corresponds to the order of the information in The process for using the forms is described. This
a research write-up. As such, coders generally bounce is the method chosen to organize forms so that the
around the coding form, recording information as they transition from coding to data entry and data man-
find it. In a spreadsheet, this means moving back and agement is facilitated, and should take into account
forth across the columns used for the different coding studies that include reports for subsamples, or mul-
items, and the number of columns needed typically far tiple occasions or measures.
exceeds what can be easily displayed on a single screen.
Furthermore, the column headings are cryptic, given lim- A sample of five-to-ten studies is chosen to test the
ited column width, provide little if any information on forms.
how to code an item. For example, it is difficult to display A study is coded by everyone, with each coder re-
the values associated with a nominal scale in a spread- cording how long it takes to code each item. These
sheet (such as, 1  randomized; 2  quasi-experimental time data are used to estimate how long it will take to
with baseline adjustment; and so on). When using a spread- code individual items, entire study reports, and the
sheet, it is also easy to inadvertently change rows or col- complete synthesis.
umns and overwrite previously coded data. Coded forms are compared and discrepancies are
The disadvantages of the spreadsheet in this context identified and resolved.
lead me to discourage its use. I recommend that you either
use paper coding forms and then enter the data into your The forms and code book are revised as necessary.
favorite statistical or meta-analytic software package (di- Another study is coded and so on. Steps four through
rectly or by creating a raw ASCII data file), or that you eight are repeated until apparent consensus is achieved.
create a database with data entry forms. (13435)
This process should be adapted to your needs. A small
9.7 TRAINING CODERS meta-analysis of five to fifteen studies may abbreviate
this process by practice coding only two or three studies
Coding studies is a challenging and at times tedious task. and then double-coding all studies to ensure coding reli-
Most research syntheses will involve a research team, ability.
even if that team consists of only two people. The training
of the coders is important to ensure the consistency and
accuracy of the information extracted from the studies. 9.7.2 Regular Meetings
Practice coding, regular meetings, coder specialization,
During the coding phase of a meta-analysis, coders should
and assessment of coder reliability all contribute to coder
meet regularly to discuss coding difficulties. Discussing
training and the quality of the final synthesis.
difficult coding decisions provides an opportunity to de-
velop a normative understanding of the application of
9.7.1 Practice Coding coding rules among all coders. Synthesists should also
annotate the coding manual during these meetings to en-
William Stock recommends a training process that in- sure that decisions made about how to handle various
volves eight steps (1994). A central feature of this pro- study ambiguities are not forgotten and similar studies
cess is practice coding of a sample of studies, comparing encountered later in the coding process can be handled in
the results among all members of the research team, a similar fashion.
174 CODING THE LITERATURE

9.7.3 Using Specialized Coders studies are typically double-coded, whereas a random
sample of studies is often selected in larger syntheses.
Typically, each coder reads through an entire study and Studies should be coded in a different order by the two
codes all the aspects the synthesis requires. This approach coders. Once a sample of studies has been double-coded,
presumes that all members of the research team have the the two sets of results are compared. Reliability is often
same skills and can accurately complete the entire proto- computed as the percentage of agreement across the sam-
col. Colleagues and research assistants working on the ple of studies. A weakness of this percentage agreement
meta-analysis may not, however. Someone with excep- is that it is affected by the base rate, resulting in artifi-
tional substantive and theoretical knowledge of the rele- cially high agreement rates for items where most studies
vant research domain, for example, may not be facile share a rating. Other reliability statistics, such as kappa or
with the computation of effect sizes, particularly when it the intra class correlation, can also be computed and are
requires heavy manipulation of the statistical data pre- likely to be more informative (for a fuller discussion, see
sented. Similarly, a research assistant with strong quanti- chapter 10, this volume).
tative abilities may not have the appropriate background A complication of determining the reliability of coded
training and experience to code substantive issues that items stems from the dependent nature of some questions.
require complex judgments. In these situations, it may be How the synthesist answers one question often restricts
advantageous for coders to specialize, each completing a the options for another. William Yeaton and Paul Wortman
different section of the coding protocol. Coder special- recommended that reliability estimates take into account
ization has the advantage of exploiting coder strengths this contingent nature of some items (1993). Ignoring this
to improving coding quality. It can also help address the hierarchy of items may result in underestimating the reli-
problem of coder drift (that is, a change in the interpreta- ability of these contingent items. They therefore recom-
tion of coding items over time), given that each coder will mended that for contingent items, reliability only be as-
proceed more quickly through the studies, focusing on a sessed when there is agreement on the higher order or
smaller set of items. Finally, you can more easily keep more general item. Although this idea has merit, I am un-
effect size coders blind to study authorship and institu- aware of any meta-analyses that have adopted it.
tional affiliation than coders of more general information. Synthesists have several options for handling an item
The issue of masking specific study features, such as au- that shows low coder agreement. First, they can drop the
thorship, is dealt with in more detail in section 9.7.5. item from any analysis. They can also rewrite the item,
possibly breaking it down into multiple simpler judg-
9.7.4 Assessing Coding Reliability ments, and then code the new item for all of the studies.
Finally, they can double-code that item for all studies and
The reliability of coding is critical to the quality of the resolve any discrepancies either by the senior researcher
research synthesis. Both intra- and interrater reliability or through an examination and discussion of the study
are important. Intrarater reliability is the consistency of a among all members of the research team.
single coder from occasion to occasion. A common prob-
lem in a research synthesis is coder drift: As more studies 9.7.5 Masking Coders
are coded, items are interpreted differently, resulting in
coding inconsistencies. This is most likely to affect items In primary research, it is often advisable to have partici-
that require judgment on the part of coders, such as an pants unaware of the conditions of assignment. For ex-
assessment of the similarity of groups at baseline. Purely ample, in a randomized clinical trial of a new drug, it is
factual items are less likely to suffer from coder drift. You common for participants to be kept uninformed as to
can assess intrarater reliability by having coders recode whether they are receiving the new drug or a placebo
studies they coded at some earlier point. Items that show drug. Similarly, keeping raters unaware of conditions,
low intrarater agreement might need to be recoded or such as medical personnel evaluating a patients status, is
double-coded to improve accuracy. also often advisable. A study may involve both of these
Interrater reliability is the consistency of coding be- methods (a double-blind study) to protect against both
tween different coders. It is assessed by having at least sources of potential bias.
two different members of the synthesis team code a sam- There is an analog in meta-analysis. Might not coders
ple of the studies. As such, in a small meta-analysis, all be influenced in making quality judgments or other coding
SYSTEMATIC CODING 175

decisions about a study by knowledge of who conducted Another common error in coding is to underestimate
the study, the authors institutional affiliation, or the study the time, effort, and difficulty of coding studies for meta-
findings? Well-known research teams might benefit from analysis. Although it may be possible to code some stud-
a positive bias and research teams at obscure or low status ies in thirty minutes to an hour, spending up to eight hours
institutions may suffer from a negative bias. Alejandro coding a single study is not unusual. Dissertations and
Jadad and his colleagues examined the effect of masking technical reports are often more time consuming than
on the ratings of study quality and found that masked as- journal articles. Resolving differences between two ver-
sessment was more consistent than open assessment and sions of a study also adds to the coding burden.
resulted in lower quality ratings, suggesting that some Finally, research synthesist should be able to clearly
coders were affected by knowledge of the authors (1996). justify why they are coding specific items. Also, in my
In another study, the University of Pennsylvania Meta- experience, individuals new to meta-analysis overestimate
Analysis Blinding Study Group randomly assigned pairs the level of detail that will be available in the individual
of coders to one of two conditions: unaware of author and studies. During the development phase of the coding
author affiliation information or aware of such informa- protocol, items for which there is rarely adequately in-
tion (Berlin 1997). Of interest was the effect of masking formation should be eliminated unless they are of enough
on effect size data. The study found no statistically or theoretical interest that it is important to document that
clinically meaningful difference in computed effect sizes. particular reporting inadequacy (for a somewhat differ-
Masked assessment may thus be beneficial in assessing ent take on this issue, see chapter 8, this volume). Also,
study quality but does not appear to matter in coding ef- the synthesist should keep in mind that their ability to use
fect size information. The latter is likely to generalize to items for moderator analyses will be limited by the size
the coding of other factual information. of the meta-analysis. Complex moderator analyses are
At present, there is not enough empirical evidence to simply not possible for small meta-analyses, reducing
provide clear guidance on the value of masking study au- the utility of a large number of variables coded for that
thorship and affiliation. Unless coders are already famil- purpose.
iar with key studies in research domain of interest, then
masking may be beneficial and fairly easy to implement.
Clearly, more work is needed in this area. 9.9 CONCLUSION
The goal of this chapter has been to provide guidance on
9.8 COMMON MISTAKES how to achieve a coding protocol that is both replicable
and transparent. No matter how fancy the statistical anal-
Novice research synthesists tend to make several mistakes
yses of effect sizes are or impressive the forest plots look,
in designing their coding protocols beyond the various
the scientific credibility of a research synthesis rests of
recommendations I have discussed already. These errors
the ability of others to scrutinize how the data on which
often stem from a failure to fully plan the analysis of the
the results of the synthesis were generated. Science pro-
meta-analytic data. The most common and serious is fail-
gresses not from an authoritarian proclamation of what is
ure to recognize the implications of the hierarchical nature
known on a topic but rather from a thoughtful, system-
of meta-analytic data. The creation of a large data file with
atic, and empirical taking of stock of the evidence that is
a single row per effect size without creating codes that
open to debate, discussion, and reanalysis. A detailed cod-
allow for the selection of independent subsets of effect
ing protocol is a key component of this process.
sizes for analysis leads to an unmanageable data structure.
If synthesists are going to code multiple effect sizes per
study, they need to think through how they are going to 9.10 REFERENCES
categorize those effect sizes such that only one effect size
(or a composite of multiple effect sizes) is selected for any Berlin, Jesse A. 1997. Does Blinding Readers Affect the
given analysis. If this is not ensured, the dependency in Results of Meta-Analysis? Lancet 350(9072): 18586.
the results yields biased meta-analytic results, such as Brown, Sharon A., Sandra L. Upchurch, and Gayle J. Acton.
standard errors that are too small (for a discussion of this 2003. A Framework for Developing a Coding Scheme for
issue and a method for analyzing statistically dependent Meta-Analysis. Western Journal of Nursing Research 25(2):
effect sizes, see chapter 19, this volume). 20522.
176 CODING THE LITERATURE

Jadad, Alejandro R., R. Andrew Moore, Dawn Carroll, Cris- Shadish, William R., and Kevin Ragsdale. 1996. Random
pin Jenkinson, D. John M. Reynolds, David J. Gavaghan, Versus Nonrandom Assignment in Psychotherapy Experi-
and Henry J. McQuay. 1996. Assessing the Quality of Re- ments: Do You Get the Same Answer? Journal of Consul-
ports of Randomized Clinical Trials: Is Blinding Neces- ting and Clinical Psychology 64(6): 12901305.
sary? Controlled Clinical Trials 17(1): 112. Stock, William A. 1994. Systematic Coding for Research
Lipsey, Mark W., and David B. Wilson. 1993. The Efficacy Synthesis. In The Handbook of Research Synthesis, edited
of Psychological, Educational, and Behavioral Treatment: by Harris Cooper and Larry V. Hedges. New York: Russell
Confirmation from Meta-Analysis. American Psychologist Sage Foundation.
48(12): 11811209. Weisburd, David, Cynthia Lum, and Anthony Petrosino.
Lipsey, Mark W., and David B. Wilson. 2001. Practical Meta- 2001. Does Research Design Affect Study Outcomes in
Analysis. Thousand Oaks, Calif.: Sage. Criminal Justice? Annals of the American Academy of So-
Lipsey, Mark W., David B. Wilson, Mark A. Cohen, and cial and Political Sciences 578(1): 5070.
James H. Derzon. 1997. Is There a Causal Relationship Weisz, John R., B. Weiss, and Geri R. Donenberg. 1992.
Between Alcohol Use and Violence? A Synthesis of Evi- The Lab Versus the Clinic: Effects of Child and Adolescent
dence. In Recent Developments in Alcoholism, vol. XIII, Psychotherapy. American Psychologist 47(12): 157885.
Alcohol and Violence, edited by Marc Galanter. New York: Wilson, David B., Denise C. Gottfredson, and Stacy S. Na-
Plenum. jaka. 2001. School-Based Prevention of Problem Beha-
MacKenzie, Doris L., David B. Wilson, and Susan Kider. viors: A Meta-Analysis. Journal of Quantitative Crimino-
2001. Effects of correctional boot camps on offending. logy 17(3): 24772.
Annals of the American Academy of Political and Social Wilson, David B., and Mark W. Lipsey. 2001. The Role of
Science 578: 12643. Method in Treatment Effectiveness Research: Evidence
Orwin, Robert G., and David B. Cordray. 1985. Effects of from Meta-Analysis. Psychological Methods 6(4): 41329.
Deficient Reporting on Meta-Analysis: A Conceptual Fra- Yeaton, William H., and Paul M. Wortman. 1993. On the
mework and Reanalysis. Psychological Bulletin 97(1): Reliability of Meta-Analytic Reviews: The Role of Interco-
13447. der Agreement. Evaluation Review 17(3): 292309.
10
EVALUATING CODING DECISIONS

ROBERT G. ORWIN AND JACK L. VEVEA

CONTENTS
10.1 Introduction 178
10.2 Sources of Error in Coding Decisions 178
10.2.1 Deficient Reporting in Primary Studies 178
10.2.2 Ambiguities in the Judgment Process 179
10.2.3 Coder Bias 180
10.2.4 Coder Mistakes 181
10.3 Strategies to Reduce Error 181
10.3.1 Contacting Original Investigators 181
10.3.2 Consulting External Literature 182
10.3.3 Training Coders 182
10.3.4 Pilot Testing the Coding Protocol 182
10.3.5 Revising the Coding Protocol 183
10.3.6 Possessing Substantive Expertise 183
10.3.7 Improving Primary Reporting 183
10.3.8 Using Averaged Ratings 183
10.3.9 Using Coder Consensus 184
10.4 Strategies to Control for Error 184
10.4.1 Reliability Assessment 184
10.4.1.1 Rationale 184
10.4.1.2 Across the Board Versus Item-by-Item Agreement 185
10.4.1.3 Hierarchical Approaches 186
10.4.1.4 Specific Indices of Interrater Reliability 186
10.4.1.5 Selecting, Interpreting, and Reporting Interrater Reliability
Indices 191
10.4.1.6 Assessing Coder Drift 192
10.4.2 Confidence Ratings 193
10.4.2.1 Rationale 193
10.4.2.2 Empirical Distinctness from Reliability 194
10.4.2.3 Methods for Assessing Coder Confidence 195

177
178 CODING THE LITERATURE

10.4.3 Sensitivity Analysis 196


10.4.3.1 Rationale 196
10.4.3.2 Multiple Ratings of Ambiguous Items 197
10.4.3.3 Multiple Measures of Interrater Agreement 197
10.4.3.4 Isolating Questionable Variables 197
10.5 Suggestions for Further Research 197
10.5.1 Moving from Observation to Explanation 197
10.5.2 Assessing Time and Necessity 199
10.6 Notes 199
10.7 References 200

10.1 INTRODUCTION scarcity of journal space, statistical mistakes, and poor


writing, for example. The consequences are differences
Coding is a critical part of research synthesis. It is an at- in the completeness, accuracy, and clarity with which em-
tempt to reduce a complex, messy, context-laden, and pirical research is reported.
quantification-resistant reality to a matrix of numbers. Treatment regimens and subject characteristics cannot
Thus it will always remain a challenge to fit the numerical be accurately transcribed by the coder when inadequately
scheme to the reality, and the fit will never be perfect. reported by the original author. Similarly, methodologi-
Systematic strategies for evaluating coding decisions en- cal features cannot be coded with certainty when research
able the synthesist to control for much of the error inher- methods are poorly described. The immediate conse-
ent in the process. When used in conjunction with other quence of coder uncertainty is coder error. One way to
strategies, they can help reduce error as well. This chap- address such shortcomings is to agree on conventions for
ter discusses strategies to reduce error as well as those to imputing guesses at the values of poorly reported data.
control for error and suggests further research to advance For example, in the Mary Smith, Gene Glass, and Thomas
the theory and practice of this particular aspect of the Miller analysis, when an investigator was remiss in re-
synthesis process. To set the context, however, it is first porting the length of time the therapist had been prac-
useful to describe the sources of error in synthesis coding ticing, the guessing convention for therapist experience
decisions. assumed five years (1980). Such a device serves to stan-
dardize decisions under uncertainty and therefore to in-
10.2 SOURCES OF ERROR IN CODING DECISIONS crease intercoder agreement. It is unknown whether it
reduces coder error, however, because there is no external
10.2.1 Deficient Reporting in Primary Studies way to validate the accuracy of the convention. Further-
more, a guessing convention carries the possibility of
Reporting deficiencies in original studies presents an ob- bias in addition to error. Unlike pure observational error,
vious problem for the synthesist, to whom the research which presumably distributes itself around the true value
report is the sole documentation of what was done and across coders, the convention-generated errors may not
what was found. Reporting quality of primary research balance out, but consistently over- or underestimate the
studies has variously been called shocking (Light and true value. This would happen, for example, if in reality
Pillemer 1984), deficient (Orwin and Cordray 1985), the average PhD therapist had been practicing eight years,
and appalling (Oliver 1987).1 In the worst case, deficient rather than five years, as in Smith, Glass, and Millers
reporting can force the abandonment of a synthesis.2 guessing convention.
Virtually all write-ups will report some information Through its influence on coding accuracy, deficien-
poorly, but some will be so vague as to obscure what cies in reporting quality degrade the integrity of later
took place entirely. The absence of clear or universally analyses in predictable ways. Errors of observation in de-
accepted norms undoubtedly contributes to the variation, pendent variables destabilize parameter estimates and
but there are other factors: different emphases in training, decrease statistical power, whereas errors of observation
EVALUATING CODING DECISIONS 179

in independent variables cause parameter estimates to be attempt to code treatment integrity per se, though they
biased (Kerlinger and Pedhazur 1973). Guessing conven- did code their degree of confidence that the labels placed
tions artificially deflate true variance in coded variables, on therapies by authors described what actually trans-
thereby diminishing the sensitivity of the analysis to de- pired. In a reanalysis, Robert Orwin and David Cordray
tect relationships with other variables. Moreover, any sys- (1985) attempted to code integrity more globally, but
tematic over- and underestimation of true values by guess- without much success. The following examples (repre-
ing conventions exacerbates matters. Specifically, doing senting actual cases) point up some of the difficulties:
so creates a validity problem (for example, therapist ex-
perience as coded would not be a valid indicator of thera- Case 1: The efficacy of ego therapy was tested against
pist experience). a placebo treatment and no-treatment controls. Sev-
An alternative to guessing conventions that often pro- eral participants did not attend every session; only
vides a better solution is to simply code the information those attending four or more sessions out of seven
as undetermined. When the variable is a categorical one, were posttested.
this has the effect of adding one more category to the pos- Case 2: The comparative efficacy of therapist-adminis-
sible levels of the moderator (for example, sex of sample tered desensitization and self-administered desensi-
predominantly female, predominantly male, mixed, or un- tization was tested (relative to various control groups)
known). In the somewhat rarer case when the problematic for reducing public speaking anxiety. The therapists
variable is continuous, the situation may be handled by were advanced graduate students in clinical psy-
dummy coding the availability of the information (0  not chology who had been trained in using desensitiza-
available, 1  available) and adding to the explanatory tion but were inexperienced in applying it.
model both the dummy variable and the product of the Case 3: The comparative effect of implosive therapy,
dummy and the continuously coded variable. This has the eclectic verbal therapy, and bibliotherapy was tested
effect of allowing the estimation of the variables impact for reducing fear of snakes. All eclectic verbal ther-
on effect size when good information about the variable apy was performed by Jack Vevea, who has pub-
is available. Such practice is in keeping with the principle lished several articles on implosive therapy.
of preference for low-inference coding (see chapter 9,
this volume). Case 1 exemplifies by far the most common treatment-
integrity problem Orwin and Cordray (1985) encoun-
10.2.2 Ambiguities in the Judgment Process tered: the potential dilution of the treatment regimen by
nonattendance. Here the coder must judge whether par-
Coding difficulty reflects the complexity of the item to be ticipants with absentee rates up to 43 percent can be said
coded, not simply the clarity with which it is reported. to have received the intended treatment. If not, the coder
There is some relationship between reporting quality and needs to determine by how much the treatment has de-
the need for judgment calls, in that deficient reporting of graded, and how this degradation affects the estimated
study characteristics increases the need for judgment in effect size (a question made still more difficult by the au-
coding. Many other variables, however, intrinsically re- thors failure to report the number of participants with
quire judgment regardless of reporting quality. In fact, nonattendance problems)? In case 2, the treatment as ad-
more judgment can sometimes be necessary when report- vertised is potentially degraded by the use of inexperi-
ing is thorough, because the coder has more information enced treatment providers. The coder must judge whether
to weigh. the lack of practice made a difference and, if so, by how
Numerous variables pose judgment problems for the much. In case 3, the treatment providers motivation to
coder. Consider, for example, treatment integrity (that is, maintain the integrity of the treatment comes into ques-
the extent to which the delivered treatment measures up tion. The coder must ascertain whether the apparent con-
to the intended treatment.) In the psychotherapy literature flict of interest degraded the treatment and, if so, by how
that Smith, Glass, and Miller synthesized, attrition from much (for example, uninspired compliance or outright
treatment (but not from measurement), spotty attendance, sabotage).
failure to meet advertised theoretical requirements, and Theoretically, the need for judgment calls in coding
assorted other implementation problems all potentially such complex constructs could be eliminated by a coding
degraded treatment integrity (1980). The authors did not algorithm that considered every possible contingency and
180 CODING THE LITERATURE

provided the coder with explicit instructions in the event and Cordray (1985) and Georg Matt (1989), the so-called
of each one, singly and in every combination. The Smith, conceptual redundancy rulethat is, the process used in
Glass, and Miller algorithm for internal validity suggests the original Smith, Glass, and Miller (1980) psychother-
an attempt at this (1980, 6364). The components of this apy synthesis for determining which of multiple effect
decision rule represent generally accepted internal valid- sizes within a given study should be counted in determin-
ity concerns. Yet, in trying to apply it, Orwin and Cordray ing overall effect sizecan be interpreted quite differ-
frequently found that it failed to accommodate several ently by different coders.4 In recoding a twenty-five-study
important contingencies and thus put their own sense of sample from the original Smith, Glass, and Miller psy-
the studys internal validity in contradiction to the score chotherapy data, four independent coders extracted 81,
yielded by the algorithm (see Orwin 1985). But the fault 159, 172, and 165 effect sizes, respectively (Smith, Glass,
is not with the failure of the algorithm to include and in- and Miller 1980; Orwin and Cordray 1985; Matt 1989).
struct on all contingencies, for in practice that would not The corresponding average effect sizes were d  .90, .47,
be possible. With a construct as complex as treatment in- .68, and .49. Thus, though the coders were attempting to
tegrity or internal validity, no amount of preliminary work follow the same decision rule for extracting effect sizes,
on the coding algorithm will eliminate the need for judg- both the number of effect sizes and the resulting findings
ment calls. Indeed, the contingency instructions are judg- varied by a factor of two. Using the same set of printed
ment calls themselves, so at best the point of judgment has guidelines on conceptual redundancy, the coders still dis-
only been moved, not eliminated.3 Nevertheless, when it agreed substantially. Not surprisingly, coder disagreement
is practical to modify the coding protocol so as to accom- on which effect sizes to include led to more discrepant
modate exceptional contingencies, the practice of striving results than coder disagreement on effect size computa-
for low-inference definitions of complex constructs may tions once the set of effect sizes had been decided on (Matt
be advantageous, if only as an aid to coding reliability. 1989). Again, more clarity in the decision rule might have
Other examples abound. Although David Terpstra better guided the judgment process, but it is unlikely to
(1981) reported perfect coding reliability in his synthesis have eliminated the need for judgment.
of organization development research, R. J. Bullock and Additional inclusion rules can compensate for the lack
Daniel Svyanteks replication (1985) reported numerous of agreement stemming from the use of relatively high
problems with both reliability and validity. Specifically, inference criteria like the conceptual redundancy rule. In
they noted that problems occurred in the coding of the this case, for instance, the synthesis can be restricted to
dependent variable, where coding required a great deal that subset of nonredundant effect sizes on which all or at
of subjectivity. Similarly, Joanmarie McGuire and her least most coders agree (Orwin and Cordray 1985). This
colleagues were unable to achieve adequate intercoder has the effect of eliminating potentially questionable ef-
agreement on methodological quality despite having meth- fect sizes. In still another variation, David Shapiro and
odologically sophisticated coders, written coding instruc- Diana Shapiro retained all measures except those permit-
tions, and clearly reported study methods (1985; see also ting only a relatively imprecise effect size estimate
chapter 7, this volume). As noted previously, coding dif- (1982, 589). These were discarded if, in the coders view,
ficulty reflects the complexity of the item to be coded, not the study provided more complete data on enough other
just the clarity with which it is reported. One solution to measures. Matt presents additional rules for selecting ef-
this problem that has gained favor is using low-inference fect sizes (1989; see also chapter 8, this volume).
criteria to assess such issues as study quality or imple-
mentation fidelity. To the degree that constructs such as 10.2.3 Coder Bias
study quality or implementation fidelity can be reduced
to unambiguous, directly observable criteria, both the va- An additional error source is coder bias (for example,
lidity and the reliability of these measures may be en- for or against a given therapy). A coder with an agenda is
hanced (see, for example, Valentine and Cooper 2008). not a good coder, especially for items that require an
Inherent ambiguities affect more fundamental coding inference. The ideal coder is totally unbiased and expert
decisions than what values to code, such as what effect in the content area, but such a coder is difficult to find.
sizes to include. Bert Green and Judith Hall observed that Some would argue that by definition, it is impossible. Ex-
synthesists are divided as to whether to use multiple out- pertise carries the baggage of opinions, and keeping those
comes per group comparison (1984). As shown by Orwin opinions out of coding decisionsmany of which are by
EVALUATING CODING DECISIONS 181

definition judgmentalis difficult to do. Ambiguities in 10.2.4 Coder Mistakes


the judgment process and coder bias are related in that
ambiguity creates a hospitable environment for bias to Of course, coders can be unbiased in the sense of holding
creep in unnoticed. no views about the likely outcomes of the research being
Sometimes bias is more blatant. In a synthesis of the coded, and still make systematic mistakessystematic in
effects of school desegregation on black achievement the statistical sense of nonrandom error. The problem is
(Wortman and Bryant 1985), a panel of six experts con- by no means unique to synthesis coding. In their analysis
vened by the National Institute of Education indepen- of errors in the extraction of epidemiological data from
dently analyzed the same set of studies, obtaining differ- patient records, Ralph Horwitz and Eunice Yu found that
ent results and reaching different conclusions. Panelists most coding errors occurred because the data extractor
excluded studies, substituted alternative control groups, simply failed to find information in the medical record
and sought missing information from authors in accor- (1984). Additional errors were made when information
dance with their prior beliefs on the effectiveness of de- was correctly extracted but the coding criteria were incor-
segregation. Even after discussion, the panel disbanded rectly applied.
with their initial views intact (315). The synthesis coding process is also subject to the
One approach to reducing bias is to keep coders selec- same range of simple coder mistakes as any other coding
tively unaware of information that triggers bias. Thomas process, including slips of the pencil and keyboard errors.
Chalmers and his colleagues had two coders indepen- It is particularly vulnerable to the effects of boredom, fa-
dently examine papers in random order, with the methods tigue, and so on. In a synthesis of any size, many hours
sections separated from the results (1987).5 In addition, are required, often over a period of months, to code a set
papers were photocopied in such a way that the coders of studies. The degradation of coder performance under
could not determine their origins. Harold Sacks and his these conditions is discussed further in section 10.4.1.6.
colleagues suggested that this is an ideal way to control
for this type of bias, but we note that it is rarely done 10.3 STRATEGIES TO REDUCE ERROR
(1987). In the eighty-six meta-analyses of randomized
clinical trials analyzed, none were successfully masked Here we discuss nine strategies that potentially reduce
(three showed evidence of attempts). The rationale for error: contacting original investigators, consulting exter-
such masking is exactly same as in primary studies, to nal literature, training coders, pilot testing the coding
reduce experimenter expectancy effects and related arti- protocol, revising the coding protocol, possessing sub-
facts. It is consistent with the central theme of this book, stantive expertise, improving primary reporting, using
that research synthesis is a scientific endeavor subject to averaged ratings, and seeking coder consensus. Although
the same rigorous standards as primary research. reducing coding error is distinct from evaluating coding
In practice, such masking procedures are difficult to decisions, it is the higher purpose that evaluation of cod-
implement, and hence are rarely followed. Two other ap- ing decisions serves, and hence merits discussion in this
proaches can help to minimize the impact of coder bias. context.
First, to the degree possible, it is best to identify how is-
sues such as implementation fidelity and study quality 10.3.1 Contacting Original Investigators
will be defined in advance of data collection; this helps
prevent creating definitions in such a way that particular An apparent solution to the problem of missing or un-
individual studies supporting a particular viewpoint will clear information is to contact the original investigators
be favored. Second, there is growing agreement that a in the hope of retrieving or clarifying it. This becomes
preference for low-inference coding is helpful with coder labor intensive when the number of studies is large, so it
bias. To the degree that issues such as study quality can is prudent to consider the odds of successful retrieval.
be objectified by criteria like experimenter masking, psy- The investigators need to be alive; they need to be lo-
chometric properties of measures, impact rating of jour- cated; they need to have collected the information in
nal, Carnegie status of primary authors institutions, and the first place; they need to have kept it; and they need
so on, subjectivity is abated, leaving less opportunity for to be willing and able to provide it. Janet Hydes synthe-
bias (for more discussion of low-inference coding, see sis of cognitive gender differences is informative here
chapters 7 and 9, this volume.) (1981). Rather than trying to estimate effect sizes from
182 CODING THE LITERATURE

incomplete information, she wrote to the authors. Of the study. Contacting test developers directly would provide
fifty-three studies in her database, eighteen lacked the more information than consulting the Mental Measure-
necessary means and standard deviations. Although all ments Yearbook, but reliabilities of experimenter-devel-
eighteen authors were located, only seven responded, and oped convenience scales are not obtainable this way. Nor
only two were able to supply the information. Further- could the problem of generalizing to different popula-
more, the successful contact rate may have been atypically tions be resolved.
high because the topic of cognitive gender differences at The use of external sources may be more successful
that time was relatively young; studies that predated the with other variables. For example, detailed information
synthesis by more than fifteen years were rare. Of course, needed for effect-size calculations may be available in a
Hydes final success rate could have been still worse had dissertation, but not in the final, shorter publication of the
more esoteric information than means and standard de- research. However, the proportion of variables that are
viations been sought. As Richard Light and David Pille- potentially retrievable using this strategy will typically be
mer noted, the chance of success probably depends quite small. For example, of the more than fifty variables coded
idiosyncratically on the field, the investigators, and other in the Smith, Glass, and Miller study, experimenter alle-
factors, such as the dates of the studies (1984). One might giance appeared to be the only variable other than experi-
suppose that the advent of electronic communication and menter affiliation that might be deducible through an ex-
electronic data storage would improve the general suc- ternal published source.
cess rate of obtaining information through author contact.
However, of the nineteen meta-analytic papers published 10.3.3 Training Coders
in Psychological Bulletin in 2006 that reported on such
attempts, eight (57 percent) indicated that the technique Given the importance and complexity of the coding task,
failed in some cases; reported response rates were some- the need for solid coder training as an error-reduction
times lower than 50 percent.6 strategy is self-evident. This topic is covered in detail in
chapter 9, this volume. The training should include a
10.3.2 Consulting External Literature phase in which coders are made familiar with the project
and the coding protocol. Ideally, all coders who will be
A subset of the information of interest to the synthesist is involved in the project should independently code a sub-
theoretically obtainable from other published sources if set of studies selected to exemplify particularly challeng-
omitted or obscured in the reports. For one variable, ex- ing coding problems. The process of jointly examining
perimenter affiliation (training), this option was exercised the results sensitizes the coders to the possibility of coder
by Smith, Glass, and Miller in their original psychother- variability and provides a vehicle for training in how to
apy synthesis, and by others since (1980). When the ex- resolve the issues that have been identified as most likely
perimenters affiliation was not evident from the report, to present difficulties. It may sometimes be practical to
the American Psychological Association directory was merge this training phase with the process of developing
consulted. In their reanalysis of the same data, Orwin and and pilot testing the coding protocol. Moreover, if the
Cordray attempted to get the reliability of the outcome protocol is revised (see section 10.3.5), the need for ap-
measure, when not extractable from the report, from the propriate retraining is evident.
Mental Measurements Yearbook (Orwin and Cordray
1985; Buros 1978). The strategy was unsuccessful, for a 10.3.4 Pilot Testing the Coding Protocol
number of reasons. Many measures were not included
(for example, less-established personality inventories, Piloting the coding protocol for a synthesis is no less
experimenter-developed convenience scales); when mea- essential than piloting any treatment or measurement pro-
sures were included, a discussion of reliability was fre- tocol in primary research. First, it supplies coders with
quently omitted; when measures were included and a direct experience in applying the conventions and deci-
discussion of reliability included, the range of estimates sion rules of the process before coding the main sample,
was sometimes too wide to be useful; and when measures which is necessary to minimize learning effects. Second,
were included, reliability discussed, and an interpretable it assesses whether a coders basic interpretations of con-
range of values provided, they were not always generaliz- ventions and decision rules are consistent with the syn-
able to the population sampled for the psychotherapy thesists intent and with the other coders intent. This is
EVALUATING CODING DECISIONS 183

necessary to preclude underestimating attainable levels is appropriate. If it is deemed appropriate, then the cod-
of interrater reliabilities. Third, it can identify inadequa- ing protocol is changed, and one must take steps to en-
cies in the protocol, such as the need for additional cate- sure the uniform (across coders and studies) and retro-
gories for particular variables or additional variables to active implementation of the change. Following such a
adequately map the studies. practice will ultimately result in higher quality, more con-
The concept of pilot testing was taken a step further by sistent coding than would be obtained if coders continued
Bullock and Svyantek in their reanalysis of a prior synthe- to use the original coding form.
sis of the organization development literature (1985).
After dividing the sixteen years of the study period in half, 10.3.6 Possessing Substantive Expertise
each author independently coded the first half so as to de-
velop reliability estimates as well as resolve preliminary Substantive expertise will not reduce the need for judg-
coding problems. After examining their work, along with ment calls but should increase their accuracy. Numerous
each instance of disagreement, the authors attempted to authors have stressed the need for substantive expertise,
improve interrater reliability by developing more explicit and with good reason: The synthesist who possesses it
decision rules for the same coding scheme. The more ex- makes more informed and thoughtful judgments at all
plicit rules were then applied to the studies from the sec- levels. Still, scholars with comparable expertise disagree
ond half of the time period. (The first half was also re- frequently on matters of judgment in the social sciences
coded in accordance with the revised protocol.) Agreement as elsewhere, particularly when they bring preexisting bi-
was higher in the second half on six of seven variables ases, as noted earlier. Substantive expertise informs judg-
tested, sometimes dramatically so (for example, agree- ment, but will not guarantee that the right call was made.
ment on sample size increased from 70 to 94 percent). It Employing low-inference coding protocols will reduce,
is not clear why the authors halved the sample by period but not eliminate, the need for expertise. We also note
rather than randomly, given that using period introduced a that low-inference codes often require a great deal of ex-
potential confound into the explanation of improvement. pertise to develop.
That is, the quality of reporting improved during the sec-
ond half of the study alone could account for much of the 10.3.7 Improving Primary Reporting
improvement in interrater reliability.7 The distinction be- Although the individual synthesist can do nothing about
tween what these authors did from what is typically done improving primary reporting, social science and medical
is that the preliminary codings were performed on a large publications can do something to reduce error in synthe-
enough subset of studies to yield reliable quantitative esti- sis coding. There is evidence of progress in this regard.
mates of agreement. This enabled an empirical validation There is a recent effort on the part of the American Psy-
that agreement had indeed improved on the second half. chological Association (APA) to encourage full reporting
of descriptive statistics and effect sizes (Journal Article
10.3.5 Revising the Coding Protocol Reporting Standards Working Group 2007). A similar
policy has been adopted by a number of prominent jour-
On occasion, inadequacies in the coding protocol will not nals in the medical field, in the form of the Consolidated
be identified in the pilot testing phase. Indeed, the Co- Standards of Reporting Trials (CONSORT) (see Begg
chrane Handbook of Systematic Reviews states that it is et al. 1996; Moher, Schulz, and Altman 2001).
rare that a coding form does not require modification after
piloting (Higgins and Green 2005). It may be, for exam- 10.3.8 Using Averaged Ratings
ple, that in the course of coding the fiftieth study, a coder
becomes aware of an ambiguity in a categorical coding It is a psychometric truism that averages of multiple inde-
scheme that requires attention. Although this issue is pendent ratings will improve reliability (and therefore
partly the domain of chapter 9, it may come to the atten- reduce error) relative to individual ratings. In principle,
tion of those monitoring the coding process as a result of then, it is desirable for the synthesist to use such averages
checks on coder reliability (such as a check for coder drift whenever possible. Two practical problems limit the ap-
or a consensus discussion). When a coder modifies the plicability of this principle. First, the resources required
way in which the coding protocol is used, it is important to to double- or triple-code the entire set of studies can be
identify that change and assess whether the modification substantial, particularly when the number of studies is
184 CODING THE LITERATURE

large or the coding form is extensive. Second, many vari- during which consensus is sought if it is not already pres-
ables of interest in a typical synthesis are categorical, and ent. It may be the case that, where there is disagreement,
therefore cannot be readily averaged across raters. For one coder has noticed a detail the others have missed, and
example, therapy modality from Smith, Glass, and Miller that when the detail is described, all will concur with the
could be coded as individual, group, family, mixed, auto- minority coder. In that respect, consensus discussions
mated, or other (1980). It is not clear how an mean rating may represent a distinct advantage over strategies such as
would be possible for such a variable (with at least three averaging or majority rule, where the correct minority
coders, a median or a modal rating might make sense). opinion would be obscured or overruled.
The first problem can be ameliorated by targeting for The consensus approach is not without pitfalls. It may
multiple coding only those variables known in advance be impractical or impossible to implement for projects of
to be both important to the analysis and prone to high large scope. The consensus meeting may present oppor-
error rates. Effect size and internal validity, for example, tunities for systematic coder bias that would not other-
are two variables that would meet both criteria. The wise arise (see section 10.2.3). One particularly insidious
foreknowledge to select the variables can frequently be form of such bias occurs if there is a tendency for coders
acquired through a targeted examination of existing re- to defer to the most senior, who may be the most likely to
search (Cordray and Sonnefeld 1985), as well as through have a conscious or unconscious agenda for the analysis.
the synthesists own pilot test. The second problem is Nevertheless, coder consensus can be a highly effective
more structural, being a property of the categorical mea- approach. Indeed, it is common practice; of the twelve
sure. A commonsense solution might be, again, to target meta-analyses published in Psychological Bulletin in 2006
only those variables that are both important to the analy- that reported any measure of rater agreement, half also
sis and error-prone and have each been coded by three mentioned that all disagreements were resolved, either by
raters. Three raters would permit a majority rule in the consensus or by appeal to a principal investigator. This
event that unanimity was not achieved. If the three coders makes the meaning of the reported agreement statistics
selected three different responsesthat is, there was no somewhat nebulous, because it does not reflect the cod-
majoritythe synthesist should probably reconsider the ing used in the meta-analyses.
use of that variable altogether.
In circumstances where it is deemed desirable to use 10.4 STRATEGIES TO CONTROL FOR ERROR
averaged codings, the rate of agreement among coders is
of less interest than the reliability of the average. Gener- It is always better to reduce error than to attempt to con-
alizability theory can provide a framework for estimating trol and correct for it after the fact. For example, methods
that reliability (Shavelson and Webb 1991). In principle, such as the correction for attenuation from measurement
a D study could estimate the number of raters needed to error, though useful, tend to overcorrect for error, because
attain a targeted reliability for the average. In practice, the reliability estimates they use in the denominator often
however, it is not common to employ more than two rat- are biased downward. Yet strategies to control for error
ers; indeed, of the nineteen meta-analytic studies pub- are necessary because, as documented earlier in this chap-
lished in Psychological Bulletin in 2006, only nine used ter, strategies to reduce error can succeed only to a lim-
multiple coders for all studies included in the meta-analy- ited degree. Whether because of deficient reporting qual-
sis, and only three double coded effect-size calculations. ity, the limits of coder judgment, or a combination of the
two, there will always be a large residual error that cannot
10.3.9 Using Coder Consensus be eliminated. The question then becomes how to control
for it, and the answer begins with strategies for evaluating
Many meta-analyses are relatively limited in their scope, coding decisions. Three such strategies are discussed
involving only a few effect sizes and potential explana- here: interrater reliability assessment, confidence ratings,
tory variables. This situation is particularly common for and sensitivity analysis.
syntheses of medical clinical trials, where the tendency is
to focus on narrowly defined questions. Often under such 10.4.1 Reliability Assessment
circumstances it is possible to arrange for every study
to be coded by two or more independent coders. When 10.4.1.1 Rationale As in primary research applica-
that occurs, it is appropriate to have periodic discussions tions, the amount of observer error in research synthesis
EVALUATING CODING DECISIONS 185

can be at least partially estimated via one or more forms situation. It would be likely to improve still more with
of interrater reliability (IRR). The reanalysis of the Smith, better-specified models, at least to the extent that report-
Glass, and Miller psychotherapy data (Orwin and Cor- ing quality and coder judgment would permit them. The
dray 1985) suggested that failing to assess IRR and to point is that though unreliability alone cannot explain the
consider it in subsequent analyses can yield misleading poor performance of the Smith, Glass, and Miller model,
results. it could be part of a larger process that does.
In the original synthesis of psychotherapy outcomes, a Unreliability in the primary studys dependent measure
sixteen-variable simultaneous multiple regression analy- attenuates not only relationships with effect size, but the
sis was used to predict outcomes. Identical regressions effect-size estimate itself. Under classical test theory as-
were run within three treatment classes and six subclasses, sumptions, measurement error has no effect on the means
as defined through multidimensional scaling techniques of the treatment and comparison groups, but increases the
with the assistance of substantive experts (see Smith, within-group variance. Hedges and Olkin showed that
Glass and Miller 1980). Orwin and Cordray recoded a under that model, the true effect size for a given study is
sample of studies from the original synthesis, and com- equal to the observed effect size over the square root of
puted IRRs (1985). They then reran the original regression the reliability of the dependent measure (1985). If the de-
analyses with corrections for attenuation. Approximately pendent measure had a reliability of .70, for example, the
twice per class and twice per subclass, on average, cor- estimated true effect size would equal the observed effect
rections based on one or more reliability estimates caused size times 1/.701/2, so the observed effect size would un-
the sign of an uncorrected predictors coefficient to re- derestimate the true effect size by 16 percent. Error in
verse.8 The implication of a sign reversal (when signifi- coding effect-size estimates then exacerbates the problem
cant) is that interpretation of the uncorrected coefficient by increasing the variance of the effect-size distribution
would have led to an incorrect conclusion regarding the at the aggregate level.
direction of the predictors influence on the effect size. In Despite wide recognition by writers on research syn-
every run in every class and subclass, the reliability cor- thesis of the need to assess IRR, in practice addressing
rection altered the ranking of the predictors in their ca- coder reliability may still be the exception rather than the
pacity to account for variance in effect size. rule. For example, only 29 percent of the meta-analyses
At least two of the major conclusions of the original published in Psychological Bulletin from 1986 through
study were brought into question by these findings. The 1988 reported a measure of coder reliability (Yeaton and
first was that the study disconfirmed allegations by critics Wortman 1991). A survey of nineteen meta-analytic pa-
of psychotherapy that poor-quality research methods pers that appeared in Psychological Bulletin in 2006 sug-
have accounted for observed positive outcomes. gests that while the situation has changed for the better,
This conclusion was based on the trivial amount of double coding is still far from standard practice. Seven of
observed correlation (r  .03) between internal validity the nineteen papers presented no information on coding
and effect size, which was taken as evidence that design reliability, and only eight reported on the reliability of
quality had no effect. The reanalyses suggested that un- coding for all variables. Interestingly, only three papers
reliability may have so seriously attenuated both the bi- reported that effect-size estimates were double coded.
variate correlation between the two variables and contri- Given the high methodological standards of Psychologi-
bution of internal validity to the regression equations cal Bulletin relative to many other social research jour-
that only an unrealistically large relationship could have nals that publish meta-analyses, the field-wide percent-
been detected. Orwin and Cordray found that the reli- age may be significantly lower.
ability of internal validity may have been as low as .36 10.4.1.2 Across the Board Versus Item-by-Item
(1985).9 Agreement Smith, Glass, and Millers reliability as-
The second Smith, Glass, and Miller conclusion sessment consisted of the computation of a simple agree-
bearing another look in light of these reanalyseswas ment rate across all variables in an abbreviated coding
that the outcomes of psychotherapy treatments cannot be form, which came to 92 percent (1980). According to
very accurately predicted from characteristics of studies. William Stock and his colleagues, who examined the
Although the reliability-corrected regressions do not ac- practice of synthesists regarding interrater reliabilities,
count for enough additional variance in effect size to subsequent synthesists did much the same thing, if that
claim very accurate prediction, they do improve the much (1982). That is, a single agreement rate, falling
186 CODING THE LITERATURE

somewhere between .7 and 1.0, is the extent of what typ- Level 3 (measure): For the type of condition, what
ically gets reported. measure is reported?
There are at least two major problems with this prac-
tice. First, it makes little psychometric sense. The coding Level 1 identifies the group as psychotherapy or con-
form is simply a list of items; it is not a multi-item mea- trol; level 2 involves a choice of ten general types of ther-
sure of a trait, such that a total-scale reliability would be apy and control groups; level 3 consists of four general
meaningful. Some items, such as publication date, will types of outcome measures used in the calculation of ef-
have very high interrater agreement, whereas others, such fect size measures. The levels of the hierarchy reflect the
as internal validity, may not. Particularly if reliabilities contingent nature of the coding decisions; that is, agree-
are to be meaningfully incorporated into subsequent anal- ment at lower levels is conditional on agreement at higher
yses (for example, by correcting correlation matrices for levels. If the variables are strictly hierarchical, and this is
attenuation), it is this variation that needs to be recog- an empirical question in each synthesis, then incorrect
nized. In their reanalysis of the Smith, Glass, and Miller coding at one level ensures incorrect coding at all lower
psychotherapy data (1980), Orwin and Cordray repli- levels, and, conversely, correct coding at the lowest level
cated an equivalently high overall agreement rate (1985). indicates that coding was correct at each higher level.
However, agreement across individual variables ranged That errors can be made at each level adds to the vari-
from .24 to 1.00 (and other indices of IRR showed similar ability of IRR. Without a hierarchical approach, this vari-
variability). ability is obscured and results can be misleading. For ex-
Second, an across-the-board reliability fails to inform ample, what does it mean to correctly code type of control
the synthesist of specific variables needing refinement or group as placebo (level 2), if in fact that group was actu-
replacement. Thus, an opportunity to improve the process ally a treatment group (level 1)?
is lost. In sum, the synthesist should assess IRR on an Yeaton and Wortman examined IRRs by level in several
item-by-item basis. syntheses and found that agreement did indeed decrease
10.4.1.3 Hierarchical Approaches When synthesists as level increased, as predicted by their hierarchical model
do compute item-by-item IRRs (or variable-by-variable (1991). The results provide yet another argument against
IRRs when variables comprise several items), they typi- an across-the-board agreement ratethat it overestimates
cally treat the reliability estimate of each item or variable true agreement because of the uneven distribution of
independently. William Yeaton and Paul Wortman sug- items across levels (level 1 items tending to be the most
gested an alternative approach (1991). In addition to rein- numerous).
forcing the need to examine item-level reliabilities rather Although more work is needed on the details of how
than simply overall reliability, the approach recognizes best to apply a hierarchical approach, the approach does
that item reliabilities are hierarchical, in that the correct- clearly present a more realistic model of the coding pro-
ness of coding decisions at a given level of the hierarchy cess than one that assumes that IRR for each variable is
is contingent on decisions made at higher levels. In such independent. At minimum, synthesists should understand
cases, the code given to a particular variable may restrict the hierarchical relationships between their own variables
the possible codes for other variables. In the hierarchical prior to interpreting the IRRs calculated on those vari-
structure, the IRRs of different variables, as well as the ables.
codings themselves, are nonindependent. Existing ap- 10.4.1.4 Specific Indices of Interrater Reliability
proaches to IRR in research synthesis do not take this into Laurel Oliver and others have noted that authorities do
account. not agree on the best index of IRR to use in coding syn-
Using three variables from the original Smith, Glass, theses (1987). This presentation does not attempt to re-
and Miller synthesis (1980), Yeaton and Wortman illus- solve all the controversies, but willit is hopedprovide
trated the hierarchical approach (1991). Each variable enough of a foundation for the synthesist who is not a
can be conceived at a different hierarchical level: statistician to make informed choices. Five indices are
presented: agreement rate, kappa and weighted kappa,
Level 1 (condition): Are these treatment or control delta, intercoder correlation, and intraclass correlation.
group data? The discussion of each will include a description (includ-
Level 2 (type): What type of treatment or control group ing formulas when possible), a computational illustra-
is used? tion, and a discussion of strengths and limitations in the
EVALUATING CODING DECISIONS 187

Table 10.1 Illustrative Data: Ratings of Studies AR is computationally simple and intuitively interpre-
table, being basically a batting average. Yet numerous
Coder writers on observational measurement have discussed the
Study 1 2
pitfalls of using it (Cohen 1960; Hartmann 1977; Light
1971; Scott 1955). When variables are categorical (the
1 3 2 usual application), the main problem is chance agree-
2 3 1 ment, particularly when response marginals are extreme.
3 2 2 For example, the expected agreement between two raters
4 3 2 on a yes-no item in which each raters marginal response
5 1 1 rate is 1090 percent would be 82 percent by chance
6 3 1 alone (Hartmann 1977). In other words, these raters could
7 2 2 post a respectable (by most standards) interrater reliabil-
8 1 1 ity simply by guessing without ever having observed an
9 2 2
actual case. Extreme marginal response rates are com-
10 2 1
11 2 2 monplace in many contexts (for example, psychiatric di-
12 3 3 agnosis of low-prevalence disorders), including research
13 3 1 synthesis (see, for example, historical effects in Kulik,
14 2 1 Kulik, and Cohen 1979; design validity in Smith 1980).
15 1 1 Additional problems arise when marginal response rates
16 1 1 differ across raters (Cohen 1960).
17 3 3 When applied to ordinal (as opposed to nominal) cate-
18 2 2 gorical variables, AR has an additional drawback: the in-
19 2 2 ability to discriminate between degrees of disagreement.
20 3 1
In Smith, Glass, and Millers three-point IQ scale, for ex-
21 2 1
22 1 1 ample, a low-high interrater pattern indicates greater dis-
23 3 2 agreement than does a low-average or average-high pat-
24 3 3 tern, yet a simple AR registers identical disagreement for
25 2 2 all three patterns (1980). The situation is taken to the ex-
treme with quantitative variables.
source: Authors compilation. Cohens Kappa and Weighted Kappa. Various statis-
tics have been proposed for categorical data to improve
on AR, particularly with regard to removing chance agree-
context of research synthesis. The following section ment (see Light 1971). Of these, Cohens kappa () has
covers the selection, interpretation, and reporting of IRR frequently received high marks (Cohen 1960; Fleiss,
indices. Cohen, and Everitt 1969; Hartmann 1977; Light 1971;
Agreement Rate. Percentage agreement, alternately Shrout, Spitzer, and Fleiss 1987). The parameter is de-
called agreement rate (AR), has been the most widely fined as the proportion of the best possible improvement
used index of IRR in research synthesis. The formula for over chance that is actually obtained by the raters (Shrout,
AR is as follows: Spitzer, and Fleiss 1987). The formula for the estimate K
number of observations agreed upon
of kappa computed from a sample is as follows:
AR  ______________________________
total number of observations (10.1) P P
K  ______
o
1P
e
, (10.2)
e
Table 10.1 presents a hypothetical data set that might
have been created had two coders independently rated where Po and Pe are the observed and expected agreement
twenty-five studies on a three-point study characteristic rates, respectively. The observed agreement rate is the
(for example, the 1980 Smith, Glass, and Miller internal proportion of the total count for which there is perfect
validity scale). On this particular characteristic, a rating of agreement between raters (that is, the sum of the diago-
1  low, 2  medium, and 3  high. As shown, the coders nals in the contingency table divided by the total count).
agreed in fifteen cases out of twenty five, so AR  .60. The expected agreement rate is the sum of the expected
188 CODING THE LITERATURE

Table 10.2 Illustrative Data: Cell Counts and Marginals ments (four of ten) were by two scale points, as one would
hope to be the case with a three-point scale. Consequently,
Coder 1 Kw is only slightly lower than K. With more scale points,
Value 1 2 3 Sum
the difference will generally be larger.
As is evident from the formula for K, chance agree-
1 5 3 4 12 ment is directly removed from both numerator and de-
Coder 2 2 0 7 3 10 nominator. Kappa has other desirable properties:
3 0 0 3 3
It is a true reliability statistic, which in large samples is
Sum 5 10 10 25
equivalent to the intraclass correlation coefficient
source: Authors compilations. (discussed later in this section; see also Fleiss and
Cohen 1973).
Because Pe, is computed from observed marginals, no
agreement cell probabilities, which are computed exactly assumption of identical marginals across raters (re-
as in a chi-square test of association. That is, quired with certain earlier statistics) is needed.
C

2 nini
1
Pe  __ , (10.3) It can be generalized to multiple (that is, more than
n i1 two) raters (Fleiss 1971; Light 1971).
where n is the number of observations, C is the number of It can be weighted to reflect varying degrees of dis-
response categories, and ni and ni are the observed row agreement (Cohen 1968), for example, for the Smith,
and column marginals for response i for raters 1 and 2, Glass, and Miller IQ variable mentioned earlier.
respectively. Large-sample standard errors have been derived for
The formula for the estimate of weighted kappa (Kw) is both K and Kw (Fleiss, Cohen, and Everitt 1969),
as follows: thus permitting the use of significance tests and con-
 i oi ,
wP
Kw  1  _______ (10.4)
fidence intervals.
 i Pei
w It takes on negative values when agreement is less than
where wi is the disagreement level assigned to cell i, and poi chance (range is 1 to 1), thus indicating the pres-
and pei are the observed and expected proportions, respec- ence of systematic disagreement as well as agree-
tively, in cell i. If regular (unweighted) K is re-expressed ment.
as 1  Do /De, where Do and De are observed and expected It can be adapted to stratified reliability designs (Shrout,
proportion disagreement, it can be seen that K is a special Spitzer, and Fleiss 1987).
case of Kw, in which all disagreement weights equal 1.
Table 10.2 shows the cell counts and marginals from Thus, K is not a single index, but rather a family of in-
the illustrative data from table 10.1. Plugging in the val- dices that can be adapted to various circumstances.
ues, Pe  [(12)(5)  (10)(10)  (3)(10)] / (25)2  .304. The One issue that the analyst must keep in mind is that
proportion of observed agreement, Po , is (5  7  3) / indices such as K may not be well estimated with the
25  .6. K  (.6  .304) / (1  .304)  .43. Thus, chance- sample sizes available for some meta-analyses. Asymp-
corrected agreement in this example is slightly less than totic standard errors for K and weighted K are available
half of what it could have been. To compute Kw, the (Fleiss, Cohen, and Everitt 1969); these may be relatively
weights (wi ) must be assigned first. Agreement cells are large unless the number of studies coded is high. It is im-
assigned weights of 0. With a three-point scale such as possible to provide a simple rule of thumb for a minimum
this one, some logical weights would be 2 for the low- sample size because the standard errors depend partly on
high interrater pattern and 1 for the low-medium and me- the cell proportions. However, it is noteworthy that for
dium-high patterns, although other weights are possible the data from table 10.1 (which produced an estimate of
of course. Summing across the wi poi products for each cell K  .43), the standard error is 0.12, leading to a 95 per-
i yields  wi poi  .56. Similarly, summing across the wi pei cent confidence interval from .18 to .67. For the example
products  wi pei  .912. Then, Kw  1  .56 / .912  .39. provided in the original paper on asymptotic standard
In this case, a relatively modest percentage of disagree- errors (Fleiss, Cohen, and Everitt 1969), a sample size of
EVALUATING CODING DECISIONS 189

60 would be required for the standard error of unweighted independence of raters. In addition to allowing an assess-
K to be as low as .10. Hence, for the numbers of studies ment of the concordance of multiple raters, the index can
present in many meta-analyses, there may be very little be computed specifying one rater as a gold standard. It
information about coding reliability. depends, however, on an assumed probability model that
Although K and Kw resolve most of the problems of AR, describes how raters choose among the options. More-
potential problems remain if observations are concen- over, the index lacks a closed-form solution, relying in-
trated in only a few cells (Jones et al. 1983). This circum- stead on iterative maximum-likelihood estimation (for
stance increases the probability that a high proportion of which software is readily available). Nonetheless, the ap-
scores will be assigned to one or two rating categories proach does have the advantage of tending to agree with
while other categories receive small or nonexistent pro- K in the absence of the base-rate problem, and tending to
portions. Allan Jones and his colleagues and others, such produce higher values that may be more representative of
as Nancy Burton 1981, have argued that this distribution actual coder agreement when the base-rate problem is
violates the assumption on which K and Kw are based, re- present. For the example of table 10.2 (for which un-
ducing the usefulness of the information. In a similar vein, weighted K was .43), the value of delta is .41.
Grover Whitehurst noted that K is sensitive to the degree Intercoder Correlation. K and its alternatives were de-
of disagreement between raters, which is desirable, but signed for categorical data and are not appropriate for
also remarkably sensitive to the distribution of ratings, continuous variables, to which we now turn. Numerous
which he argued was not desirable (1984). A highly skewed statistics have been suggested to assess IRR on continu-
distribution of ratings will yield high estimates of chance ous variables as well. One of the more popular is the
agreement in just those cells in which obtained agreement common Pearson correlation coefficient (r), sometimes
is also high. This makes for low estimates of true agree- called the intercoder correlation in this context. Pearson r
ment in the face of high levels of obtained agreement and, is as follows: n __ __
consequently, low estimates of IRR. As noted earlier, the (X  X )(Y  Y )
i i
phenomenon is common in psychiatric diagnosis when r  _____________
i1
n sX sY , (10.5)
prevalence is low and has been termed the base rate prob-
lem with K (Carey and Gottesman 1978). where (Xi,Yi) are the n pairs of values, and sX and sY are the
However, others view this as a case of shooting the standard deviations of the two variables computed using
messenger (see, for example, Shrout, Spitzer, and Fleiss n in the denominator. With more than two coders, this
1987). As they see it, the observation that low base rates index is generally obtained by calculating r across all
decrease K is not an indictment of K, but rather represents pairs of coders rating the phenomenon (Jones et al. 1983).
the real problem of making distinctions in increasingly The resulting correlations can then be averaged to deter-
homogeneous populations. Similarly, Rebecca Zwick mine an overall value.__ __
questioned whether the sensitivity of K to the shape of the In our example, X and Y are 2.2 and 1.64, and sx and sy
marginal distributions is necessarily undesirable: if cases are .75 and .69, respectively, and n is 25. Plugging the
are concentrated into a small number of categories, it is values into the Equation 10.5, r  5.8 / 12.94  .45. In
less demonstrable that coders can reliably discriminate practice, the synthesist will only rarely need to calculate
among all C categories, and the IRR coefficient should r by hand, as all standard statistical packages and many
reflect this (1988). Finally, there is no mathematical ne- hand calculators compute it.
cessity for small K values with low base rates (the maxi- The use of Pearson r with observational data to esti-
mum value of K remains at 1.0 even when the base rate is mate IRRs is analogous to its use in education to estimate
low) and high Ks have in fact been demonstrated empiri- test reliabilities when parallel test forms are available (see
cally with base rates as low as 2 percent (American Psy- Stanley 1971). As such, it bears the same relationship to
chiatric Association 1980). formal reliability theory.10 The Pearson r also has some
Andrs and Marzos Delta. One measure of agree- drawbacks. First, although it describes the degree to which
ment that addresses the base-rate problem of kappa (if, the scores produced by each coder covary, it says nothing
indeed, that is a problem) adapts theory from guessing about the degree to which the scores themselves are iden-
corrections in multiple-choice testing (Andrs and Marzo tical. In principle, this means that coders can produce high
2004). The proposed index () has one somewhat more rs without actually agreeing on any scores. Conversely,
relaxed assumption than Knonconcordance rather than an increase in absolute agreement could actually reduce r
190 CODING THE LITERATURE

if the disagreement had been part of a consistent disagree- Designs 1 and 2 are common in research synthesis. Al-
ment pattern (for example, one rater rating targets consis- though actual random selection of coders is rare, the
tently lower than the other).11 If the IRR estimate will be synthesists usual intent is that those selected represent, at
used only to adjust subsequent analyses, this may not be least in principle, a larger population of coders who might
particularly important. If it will be used to diagnose coder have been selected. Whether explicitly stated or not, syn-
consistency in interpreting instructions (for example, in- thesists typically present their substantive findings as
structions for extracting effect sizes), it can be quite im- generalizable across that population. The findings will
portant. Second, the between-coders variance is always hold little scientific interest if, say, only a particular sub-
removed in computing the product-moment formula. set of graduate students at a particular university can re-
Robert Ebel noted that this is especially problematic produce them. Design 3 will hold in rare instances in
when comparisons are made among single raw scores as- which the specific coders selected are the population of
signed to different subjects by different codersin re- interest and can therefore be modeled as a fixed effect.
search synthesis, of course, the subject is the individual This could happen if, for example, the coders are a spe-
study (1951). cially convened panel of all known experts in the content
Intraclass Correlation. Analogous to the concepts of area (for example, the Wortman and Bryant school deseg-
true score and error in classical reliability theory, the in- regation synthesis described in section 2 might have met
traclass correlation (rI) is computed as a ratio of the vari- this criterion).
ance of interest over the sum of the variance of interest Each design requires a different form of rI, based on a
plus error. Ebel and others suggested that rI was prefera- different analysis of variance (ANOVA) structure. These
ble to Pearson r as an index of IRR with continuous vari- forms for the running example are estimated as follows
ables because it permits the researcher to choose whether (for a fuller discussion of the statistical models underly-
to include the between-raters variance in the error term ing the estimates, see Shrout and Fleiss 1979):
(1951).
BMS  WMS ,
Like K, rI is not a single index but a family of indices r1(design 1)  ___________
BMS  WMS
(10.6)
that permit great flexibility in matching the form of the
BMS  EMS
reliability index to the reliability design used by the syn- r1(design 2)  ____________________________
BMS  EMS  2(CMS  EMS)/25
, (10.7)
thesist; here we mean design in the sense used in Cron-
BMS  EMS ,
bachs generalizability (G) theory (see Cronbach et al. r1(design 3)  ___________
BMS  EMS
(10.8)
1972). Generalizability theory enables the analyst to iso-
late different sources of variation in the measurement, for where BMS and WMS are the between-studies and within-
example, forms or occasions, and to estimate their mag- study mean squares, respectively, and CMS and EMS are
nitude using the analysis of variance (Shavelson and the between-coders and residual (error) mean square
Webb 1991). The reliability design as discussed here is a components, resulting from the partitioning of the within-
special case of the one-facet G study, in which coders are study sum of squares. Note that this partitioning is not
the facet (Shrout and Fleiss 1979). possible in design 1; the component representing the coder
So far, the running example has been discussed in effect is estimable only when the same pair of coders
terms of twenty-five studies being coded by two coders, rated all twenty-five studies.12
but in fact at least three different reliability designs share The next step is the computation of sums of squares
this description. and mean squares from the coder data (for computational
Design 1: Each study is rated by a pair of coders, ran- details, see any standard statistics text, for example, Hays
domly selected from a larger population of coders 1994). Table 10.3 shows the various mean squares and
(one-way random effects model). their associated degrees of freedom for the data from
table 10.1. Plugging the values into equations 10.6, 10.7,
Design 2: A random pair of coders is selected from a and 10.8 yields the following results:
larger population, and each coder rates all twenty-
five studies (two-way random effects model). rI (design 1)  .28,
Design 3: All twenty-five studies are rated by each of
rI (design 2)  .35,
the same pair of coders, who are the only coders of
interest (two-way mixed effects model). rI (design 3)  .44.
EVALUATING CODING DECISIONS 191

Table 10.3 Analysis of Variance for Illustrative Ratings Interested readers can explore that literature and draw
their own conclusions. They should keep in mind, how-
Degrees of Mean ever, that the prevailing view among statisticians is that,
Source of Variance Freedom Squares to date, alternatives to rI and K are not improvements, but
in fact propose misleading solutions to misunderstood
Between-studies (BMS) 24 .78 problems (see, for example, Cicchetti 1985 on Whitehu-
Within-study (EMS) 25 .44 rst 1984; Shrout, Spitzer, and Fleiss 1987 on Spitznagel
Between-coders (CMS) 1 3.92 and Helzer 1985). The synthesist who sticks to the ap-
Residual (EMS) 24 .30 propriate form of K for categorical variables, and the ap-
source: Authors compilation.
propriate model of rI, for continuous variables, will be on
solid ground. Either AR or r can be used to supplement K
and rI for the purpose of computing multiple indices for
sensitivity analysis (see section 10.4.3.3). The synthesist
On average, rI (design 1) will give smaller values than who chooses to rely on one of the less common alterna-
rI (design 2) or rI (design 3) for the same set of data tives (such as delta) would be wise to first become inti-
(Shrout and Fleiss 1979). mately familiar with its strengths and weaknesses and
When its assumptions are met, the rI family has been should probably anticipate defending the choice. The de-
mathematically shown to provide appropriate estimates of fense is likely to be viewed more favorably if the alterna-
classical reliability for the IRR case (Lord and Novick tive approach is used to supplement one of the more usual
1968). Its flexibility and linkage into G theory also argue indexes.
in its favor.13 Perhaps most important, its different variants Once an index is selected and computed, the obvious
reinforce the idea that a good IRR assessment requires the question is how large it should be. The answer is less ob-
synthesist to think through the appropriate IRR design, vious. For r, .80 is considered adequate by many psycho-
not just calculate reliability statistics. However, because metricians (for example, Nunnally 1978), and at that level
the rI is ANOVA-based, it fails to produce coefficients that correlations between variables are attenuated very little
reflect the consistency of ratings when assumptions of by measurement error. Rules of thumb have also been
normality are grossly violated (Selvage 1976). Further- suggested and used for K and rI (Fleiss 1981; Cicchetti
more, like K, it requires substantial between-items vari- and Sparrow 1981).
ance to show a significant indication of agreement. As These benchmarks, however, were suggested for eval-
with K, some writers consider r I less useful as an index of uating IRR of psychiatric diagnoses, and their appropri-
IRR when the distributions are concentrated in a small ateness for synthesis coding decisions is not clear. Indeed,
range (for example, Jones et al. 1983). The arguments and general caution is usually appropriate about statistical
counterarguments on this issue are essentially as described rules of thumb that are not tied to a specific context;
for K. for that reason, we believe it inappropriate to provide
10.4.1.5 Selecting, Interpreting, and Reporting In- them here (for a similar take on the inappropriateness of
terrater Reliability Indices With the exception of AR, context-independent rules of thumb, see chapter 7, this
the indices presented in section 10.4.1.4 have appealing volume).
psychometric properties as reliability measures and, when The issue of how large is large enough is further com-
properly applied, can be recommended for use in evaluat- plicated by distributional variation. The running example
ing coding decisions. The AR was included because of showed how the different indices varied for a particular
its simplicity and widespread use by research synthesis data set, but not how they vary over different conditions.
practitioners, but, for the reasons discussed earlier, it can- Allan Jones and his colleagues systematically compared
not be recommended as highly. These five indices by no the results obtained when these indices are applied to a
means exhaust all those proposed in the literature for as- common set of ratings under various distributional as-
sessing IRR. Most of these, such as Finns r (Whitehurst sumptions (1983). Four raters rated job descriptions using
1984), Yules Y (Spitznagel and Helzer 1985), Maxwells a questionnaire designed and widely used (and validated)
RE (Janes 1979), and Scotts (Zwick 1988), were pro- for that purpose. The majority of ratings were on six-
posed to improve on perceived shortcomings of the indi- point Likert-type scales with specific anchors for each
ces presented here, in particular the rI, and K family. point; a minority were dichotomous. The K, Kw, AR, r (in
192 CODING THE LITERATURE

Table 10.4 Estimates of Interrater Agreement for Different Types of Data Distributions
Weighted Agreement Average Intraclass
Distributional Conditions Kappa Kappa Rate Correlation Correlation

Variations in ratings across .43 .45 .88 .79 .74


jobs and high agreement
among raters
Variations in ratings across .01 .04 .16 .13 .05
jobs and low agreement
among raters
Little variation in ratings .04 .04 .77 .01 .03
across jobs and high
agreement among raters

source: Jones et al. 1983.

the form of average pairwise correlations), and rI were suggested, particularly when they are very low or very
computed on each item. Aggregated across items, K and high (1981).
Kw yielded the lowest estimates of IRR, producing me- 10.4.1.6 Assessing Coder Drift Whatever index of
dian values of .19 and .22, respectively. The AR yielded IRR is chosen, the synthesist cannot assume that IRR will
the highest, with a median value of .63; r and rI occupied remain stable throughout the coding period, particularly
an intermediate position, with median values of .51 and when the number of studies to code is large. As Gregg
.39, respectively. For the reasons noted earlier, an overall Jackson noted, coder instability arises because many hours
aggregate agreement index is not particularly meaningful are required, often over a period of months, to code a set
for evaluating coding decisions, but is still useful for il- of studies (1980). When the coding is lengthy, IRR may
lustrating the wide variation among indices. change over time, and a single assessment may not be
To examine the effect of the distributions of individual adequate.
items on the variation across indices, sample items repre- Orwin and Cordray assessed coder drift in their reanal-
senting three conditions were analyzed: high variation ysis of the 1980 Smith, Glass, and Miller psychotherapy
across ratings and high agreement across raters, high vari- synthesis (1985). Before coding, the twenty-five reports
ation across ratings and low agreement across raters, and in the main sample were randomly ordered, with coders
low variation across ratings and high agreement across instructed to adhere to the resulting sequence. It was as-
raters. As shown in table 10.4, the effects can be substan- sumed, given the random ordering, that any trend in IRRs
tial, in particular under the third condition, where AR reg- over reports would be attributable to changes in coders
istered 77 percent agreement, but the other indices sug- over time (for example, practice or boredom).14 Equating
gested no agreement beyond chance. With moderate to for order permits this trend to be detected and, if desired,
high variance, it makes far less difference which index is removed.
used. Therefore, Jones and his colleagues argued, a proper Following the completion of coding, a per case agree-
interpretation of the different indicesincluding whether ment rate was computed, consisting of the number of
they are large enoughrequires an understanding of the variables agreed on divided by the number of variables
actual distributions of the data (1983). coded.15 In itself, per case agreement rate is not particu-
It should be evident by now that the indices are not larly meaningful because it weights observations on all
directly comparable, even though all but AR have the variables equally and is sensitive to the particular subset
same range of possible values. It is therefore essential of variables (considered here as sampled from a larger
that synthesists report not only their IRR values, but the domain) constituting the coding form. Change in this in-
indices used to compute them. To give the reader the full dicator over time is meaningful, however, because it
picture, it would also be wise to include information about speaks to the stability of agreement. Per case agreement
the raters base rates, as William Grove and his colleagues rate was plotted against case sequence, and little if any
EVALUATING CODING DECISIONS 193

trend was observed. The absence of significant correla- they coded whether in their view the information was sig-
tion between the two variables substantiated this conclu- nificantly distorted by the patients misrepresentation or
sion. In regression terms, increasing the case sequence by the patients inability to understand.
one resulted in a decrease of the agreement rate by less Two early meta-analyses, one that integrates the litera-
than one-tenth of 1 percent (b  .09). From this exer- ture on the effects of psychotherapy (Smith and Glass
cise it was concluded that further consideration of de- 1977; Smith, Glass, and Miller 1980) and another de-
trending IRRs was unnecessary in our synthesis. We do voted to the influence of class size on achievement (Glass
not presume, though, to generalize this finding to other and Smith 1979), are used to illustrate how some of the
syntheses. Coder drift will be more of a concern in some preceding issues can be assessed. For both, the data files
syntheses than in others and needs to be assessed on a were obtained and subjected to reanalysis (Cordray and
case-by-case basis. Orwin 1981; Orwin and Cordray 1985). In examining the
The drift assessment could also be integrated into the primary studies, these authors were careful to note (for
main IRR assessment with a G theory approach. The one- some variables) how the data were obtained. For exam-
facet G study with coders as the facet could be expanded ple, in the class size study, the actual number of students
to a two-facet G study with time (if measured continu- enrolled in each class was not always reported directly, so
ously) or occasions (if measured discretely) as the second the values recorded in the synthesis were not always
facet. This would permit the computation of a single IRR based on equally accurate information. In response to
coefficient that captured the variance contributed by each this, Glass and Smith included a second variable that
facet, as well as by their interaction. scored the perceived accuracy of the numbers recorded.
Unfortunately, the accuracy scales were never used in the
10.4.2 Confidence Ratings analysis (Gene Glass, personal communication, August
1981). Similarly, in the psychotherapy study, source of
10.4.2.1 Rationale As a strategy for evaluating cod- IQ and confidence of treatment classifications were also
ing decisions, IRR is indirect. Directly, it assesses only coded and not used.
coder disagreement, which in turn is an indicator of coder The synthesis of class size and achievement compared
uncertainty, but a flawed one. It is desirable, therefore, to smaller class and larger class on achievement (Glass and
seek more direct methods of assessing the accuracy of Smith 1979). After assigning each comparison an effect
the numbers gleaned from reports. Tagging each number size, the authors regressed effect size on three variables:
with a confidence rating is one such method. It is based the size of the smaller class (S ), the size of the smaller
on the premise that questionable information should not class squared (S2 ), and the difference between the larger
be discarded, and should not be allowed to freely mingle class and smaller class (L S ). This procedure was done
with less questionable information, as is usually done. for the total database and for several subdivisions of the
The former procedure wastes information, which, though data, one of which was well controlled versus poorly con-
flawed, may be the best available on a given variable, trolled studies. Cordray and Orwin reran the regressions
whereas the latter injects noise at best and bias at worst as originally reported, except that a dummy variable for
into a system already beset by both problems. With con- overall accuracy was added to the equations (1981). The
fidence ratings, questionable information can be de- results are presented in table 10.5. Most remarkable is that
scribed as such, and both data point and descriptor en- accuracy made its only nontrivial contribution in well-
tered into the database. controlled studies. A possible explanation is that differen-
In other contexts, confidence ratings and facsimiles tial accuracy in the reporting of class sizes is only one
(for example, adequacy ratings, certainty ratings) have of many method factors operating in poorly controlled
been around for some time. For example, in their interna- studies (and in the entire sample, as well-controlled stud-
tional comparative political parties project, Kenneth ies are a minority) and that these mask the influence of
Janda constructed an adequacy-confidence scale to evalu- accuracy by interacting with it in unknown ways or by
ate data quality (1970). Similarly, it is not unusual to find inflating error variance. In any event, the considerable in-
interviewer confidence ratings embedded in question- fluence of accuracy in well-controlled studies suggested
naires. In the Addiction Severity Index, for example, in- that it does matter, at least under some circumstances.
terviewers provided confidence ratings of the accuracy of In the synthesis of psychotherapy outcomes described
patient responses (McLellan et al. 1988). Specifically, earlier, an identical simultaneous multiple regression
194 CODING THE LITERATURE

Table 10.5 Comparison of R 2 for Original Glass and Smith (1979) Class Size
Regression (R12) and Original with Accuracy Added (R22)

R12 R22 R22  R12

Total sample (n  699) .1799 .1845 .0046, F(1,694)  3.94*


Well-controlled studies (n  110) .3797 .4273 .0476, F(1,105)  8.73*
Poorly controlled studies (n  338) .0363 .0369 .0006, F(1,333)  0.65

source: Cordray and Orwin 1981.


* p  .05

analysis was run within each treatment class and subclass, individual variables. For example, the outcome measures
but with a different orientation. Rather than specifying that are easiest to classify could also be the most reliable.
the treatment in the model and crosscutting by nontreat- In such circumstance, the outcome type-effect size rela-
ment characteristics, as with class size, the nontreatment tionship should be stronger for high confidence observa-
characteristics (diagnosis, method of client solicitation, tions simply because the less reliable outcome types have
and so on) were included in the model. Orwin and Cor- been weeded out. An explanation like this can be plausible
dray recoded twenty variables and found that across for individual variables, but cannot account for the gener-
items, high confidence proportions ranged from 1.00 for ally consistent and nontrivial strengthening of observed
comparison type and location down to .09 for therapist relationships throughout the system. Each study in the
experience (1985; for selection criteria, see Orwin 1983b). sample has its own pattern of well-reported and poorly
Across studies, the proportion of variables coded with reported variables; there was not a particular subset of
certainty or near certainty ranged from 52 to 83 percent. studies, presumably different from the rest, that chroni-
At the opposite pole, the proportion of guesses ranged cally fell out when high confidence observations were iso-
from 25 to 0 percent. lated. Consequently, any alternative explanation must be
The effect of confidence on reliabilities was dramatic; individually tailored to a particular relationship. Postulat-
for example, the mean AR for high-confidence observa- ing a separate ad hoc explanation for each relationship
tions more than doubled that for low-confidence observa- would be unparsimonious to say the least.
tions (.92 vs. .44). As to the effect of confidence on rela- 10.4.2.2 Empirical Distinctness from Reliability It
tionships between variables, 82 percent of the correlations was argued previously that confidence judgments provide
increased in absolute value when observations rated with a more direct method of evaluating coding decisions than
low and medium confidence were removed. Moreover, all does IRR. It is therefore useful to ask whether this con-
correlations with effect size increased. The special im- ceptual distinctness is supported by an empirical distinct-
portance of these should be self-evident; the relationships ness. If the information provided by confidence judgments
between study characteristics and effect size were a major simply duplicated that provided by reliabilities, the case
focus of the original study (and are in research synthesis for adopting both strategies in the same synthesis would
generally). The absolute values of correlations increased be weakened.
by an average of .15 (.21  .36), or in relative terms, 71 Table 10.6 presents agreement rates by level of confi-
percent. dence for the complete set of variables (K  25) to which
Questions posed by these findings include how credible confidence judgments were applied. The table indicates
the lack of observed relationship between therapist expe- that interrater agreement is neither guaranteed by high
rience and effect size can be when therapist experience is confidence nor precluded by low confidence. Yet it also
extracted from only one out of ten reports with confidence. shows that confidence and agreement are associated.
Even for variables averaging considerably higher confi- Whereas table 10.6 shows the nonduplicativeness of con-
dence ratings, the depression of IRRs and consequent at- fidence and agreement on a per variable basis, table 10.7
tenuation of observed relationships between variables is shows this nonduplicativeness across variables. To enable
evident. There are plausible alternative explanations for this, variables were first rank ordered by the proportion of
EVALUATING CODING DECISIONS 195

Table 10.6 Agreement Rate by Level of Confidence

Low Medium High

Experimenter affiliation 1.00 (14) 1.00 (37) .95 (75)


Blinding .83 (6) .91 (66) .93 (44)
Diagnosis 1.00 (12) .99 (114)
Client IQ 1.00 (11) .18 (74) 1.00 (41)
Client age 1.00 (1) .68 (50) .83 (75)
Client source 1.00 (10) .89 (116)
Client assessment .53 (17) .98 (103)
Therapist assessment .00 (15) .67 (18) .75 (93)
Internal validity .52 (27) .88 (93)
Treatment mortality .00 (4) 1.00 (1) .94 (121)
Comparison mortality .00 (4) 1.00 (1) .93 (121)
Comparison type 1.00 (126)
Control group type .00 (9) .69 (114)
Experimenter allegiance .10 (10) .64 (36) 1.00 (80)
Modality .33 (4) 1.00 (122)
Location 1.00 (126)
Therapist experience .55 (55) .56 (57) 1.00 (11)
Outcome type .00 (1) .00 (11) .92 (114)
Follow-up .00 (2) .92 (47) .83 (75)
Reactivity .83 (6) .28 (65) .94 (47)
Client participation 1.00 (1) .67 (3) 1.00 (122)
Setting type .40 (15) .99 (111)
Treatment integrity .60 (10) .83 (60) 1.00 (56)
Comparison group contamination .30 (37) .32 (50) 1.00 (39)
Outcome Rxx .13 (150) .56 (70) .92 (38)

source: Orwin 1983b.


note: Selected Variables from the Smith, Glass, and Miller (1980) Psychotherapy Meta-Analysis (n  126)
(sample sizes in parenthesis)

observations in which confidence was judged as high. As Alternative schemes are of course possible. One might
is clear in the table, the correlation between the two sets use two confidence judgments per data pointone rating
of rankings fell far short of perfect, regardless of the reli- confidence in the accuracy or completeness of the infor-
ability estimate selected. In sum, the conceptual distinct- mation as reported and the other rating confidence in the
ness between reliability and confidence is supported by coding interpretation applied to that information. The use
an empirical distinctness. of two or more such ratings explicitly recognizes multiple
10.4.2.3 Methods for Assessing Coder Confidence sources of error in the coding process (as described in sec-
There is no set method for assessing confidence. Orwin tion 10.2) and makes some attempt to isolate them. More
and Cordray used a single confidence item for each vari- involved schemes, such as Jandas scheme might also
able (1985). This is not to imply that confidence is unidi- be attempted, particularly if information is being sought
mensional; no doubt it is a complex composite of numer- from multiple sources (1970). As noted earlier, Janda con-
ous factors. These factors can interact in various ways structed an adequacy-confidence scale to evaluate data
to affect the coders ultimate confidence judgment, but quality. The scale was designed to reflect four factors con-
Orwin and Cordray did not attempt to spell out rules for sidered important in determining the researchers belief in
handling the many contingencies that arise. Confidence the accuracy of the coded variable values: the number of
judgments reflected the overall pattern of information as sources providing relevant information for the coding de-
deemed appropriate by the coder. cision, the proportion of agreement to disagreement in the
196 CODING THE LITERATURE

Table 10.7 Spearman Rank Order Correlations 10.4.3 Sensitivity Analysis


rRHO 10.4.3.1 Rationale Sensitivity analysis can assess ro-
bustness and bound uncertainty. It has been a part of re-
All Variables (K  25) search synthesis at least since the original Smith and
Agreement rate .71
Glass psychotherapy study (1977). Glasss position of in-
a
Variables for Which Kappa was Computed cluding methodologically flawed studies was attacked by
(K  20) his critics as implicitly advocating the abandonment of
Agreement rate .71 critical judgment (for example, Eyesenck 1978). He re-
Kappa .62
butted these and related charges on multiple grounds, but
Variables for Which Intercoder Correlation was the most enduring was the argument that meta-analysis
Computed (K  15) does not ignore methodological quality, but rather pres-
Agreement rate .81 ents a way of determining empirically whether particular
Intercoder rate .67
methodological threats systematically influence outcomes.
Variables for Which All Three Estimates Were Glasss indicators of quality (for example, the three-point
Computed (K  10) internal validity scale) were crude then and appear even
Agreement rate .73 cruder in retrospect, and may not have been at all suc-
Kappa .79
cessful in what they were attempting to do.16 The princi-
Intercoder correlation .66
ple, howeverthat of empirically assessing covariance
source: Authors compilation. of quality and research findings rather than assuming it a
note: Between Confidence and Interrater Agreement for Selected prioriwas perfectly sensible and consistent with norms
Variables from the Smith, Glass, and Miller (1980) Psychotherapy of scientific inquiry. In essence, the question was whether
Meta-Analysis
research findings were sensitive to variations in method-
ological quality. If not, lesser-quality studies can be ana-
lyzed along with high-quality studies, with consequent
information reported by different sources, the degree of increase in power, generalizability, and so on. The worst-
discrepancy among sources when disagreement exists, case scenariothat lesser-quality studies produce sys-
and the credibility attached to the various sources of in- tematically different results and cannot be usedis no
formation. An adequacy-confidence value was then as- worse than had the synthesist followed the advice of the
signed to each recorded variable value. critics and excluded those studies a priori.
Along with the question of what specific indicators of The solutions of Rosenthal and Orwin to the so-called
confidence to use is the question of how to scale them. file-drawer problem (that is, publication bias), as well
Orwin and Cordray pilot tested a five-point scale (1  as more recent approaches are in essence types of sensi-
low, . . . , 5  high) for each confidence rating, fashioned tivity analysis as well (Rosenthal 1978; Orwin 1983a;
after the Smith, Glass, and Miller confidence of treat- see chapter 23, this volume; Rothstein, Sutton, and Bo-
ment classification variable (1985). Analysis of the pilot renstein 2005). They test the datas sensitivity to the
results revealed that five levels of confidence were not violation of the assumption that the obtained sample of
being discriminated. Specifically, the choices of 2 versus studies is an unbiased sample of the target population.
3 and 3 versus 4 seemed to be made arbitrarily. The five While some level of sensitivity analysis has always been
categories were then collapsed into three for the main part of research synthesis, much more could have been
study. In addition, each category was labeled with a ver- done. In their synthesis of eighty-six meta-analyses of
bal descriptor to concretize it and minimize coder drift: randomized clinical trials, Harold Sacks and his col-
3  certain or almost certain, 2  more likely than not, leagues found that only fourteen presented any sensitivity
1  guess (see Kazdin 1977). Discrepancies were re- analyses, defined as data showing how the results vary
solved through discussion. The three-point confidence through the use of different assumptions, tests, and crite-
scale was simple yet adequate for its intended purpose ria (1987, 452). In the corpus of nineteen meta-analyses
to establish a mechanism for discerning high-quality published in Psychological Bulletin in 2006, fewer than
from lesser-quality information in the conduct of subse- half (eight) discussed sensitivity analyses related to pub-
quent analyses. lication bias.
EVALUATING CODING DECISIONS 197

The general issue of sensitivity analysis in research multiple estimates of IRR for each variable (1985). For
synthesis is discussed in chapter 22, this volume. This continuous variables, AR and r were computed (rI was
chapter focuses on applying the logic of sensitivity analy- not computed, but should have been). For ordinal cate-
sis to the evaluation of coding decisions. gorical variables, AR, r, and Kw were computed. For nom-
10.4.3.2 Multiple Ratings of Ambiguous Items The inal categorical variables, AR and K were computed; when
sources of error identified earlier (for example, deficient nominal variables were dichotomous, r (in the form of
reporting, ambiguities of judgment) are not randomly phi) was also computed. Four regressions were then run,
dispersed across all variables. The synthesist frequently each using a different set of reliability estimates. The first
knows at the outset which variables will be problematic. used the highest estimate available for each variable. Its
If not, a well-designed pilot test will identify them. For purpose was to provide a lower bound on the amount of
those variables, multiple ratings should be considered. change produced by disattenuation. The second and third
Like multiple measures in other contexts, multiple ratings runs were successively more liberal (for details, see
help guard against biases stemming from a particular way Orwin and Cordray 1985). A final run, intended as the
of assessing the phenomenon (monomethod bias) and can most liberal credible analysis, disattenuated the criterion
set up sensitivity analyses to determine if the choice of variable (effect size) as well as the predictors. The reli-
rating makes any difference. Mary Smiths synthesis of ability estimates for the four runs are shown in table 10.8.
sex bias is an example (1980). When an investigator re- 10.4.3.4 Isolating Questionable Variables Both IRRs
ported only their significant effect, Smith entered an ef- and confidence ratings are useful tools for flagging vari-
fect size of 0 for the remaining effects. Aware of the pit- ables that may be inappropriate for use as meta-analytic
falls of this, she alternately used a different procedure and predictors of effect size. In the Orwin and Cordray study,
deleted these cases. Neither changed the overall mean ef- for example, the therapist experience variable was coded
fect size by more than a small fraction. In their synthesis with high confidence in only 9 percent of the studies, was
of sex differences, Alice Eagly and Linda Carli took the guessed in 45 percent, and had an IRR (using Pearson
problem of unreported nonsignificant effects a step fur- r) of .56 (1985). Such numbers would clearly suggest the
ther (1981). Noting that the majority of these (fifteen of exercise of caution in further use of that variable. Con-
sixteen) tended in the positive (female) direction, they ducting analyses with and without it, and comparing the
reasoned that setting effect size at 0 would lead to an un- results, would be a logical first step.
derestimation of the true mean effect size, whereas delet- With regard to both cases and variables, the finding of
ing them would lead to overestimation. They therefore significant differences between inclusion and exclusion
did both to be confident of at least bracketing the true does not automatically argue for outright exclusion;
value. They also used two indicators of researchers sex rather, it alerts the analyst that mindless inclusion is not
(an important variable in this area), overall percentage of warranted. As in primary research, the common practices
male authors and sex of first author. These were found to of dropping cases, dropping variables, and working from
differ only slightly in their correlations with outcome. a missing-data correlation matrix result in loss of infor-
Each of these examples illustrates sensitivity analysis di- mation, loss of statistical power and precision, and biased
rected at specific rival hypotheses. It should be evident estimates when, as is frequently the case, the occurrence
that many opportunities exist for thoughtful sensitivity of missing data is nonrandom (Cohen and Cohen 1975).
analyses in the evaluation of coding decisions and that Therefore, more sophisticated approaches are preferable
these do not require specialized technical expertise; con- when feasible (for more on the treatment of missing data
scientiousness and common sense will frequently suffice. in research synthesis, see chapter 21, this volume).
10.4.3.3 Multiple Measures of Interrater Agree-
ment As described earlier, the choice of IRR index can 10.5 SUGGESTIONS FOR FURTHER RESEARCH
have a significant effect on the reliability estimate ob-
tained. Although the guidelines in section 10.4.1.5 should 10.5.1 Moving from Observation to Explanation
narrow the range of choices, they may not uniquely iden-
tify the best one. Computing multiple indices is therefore As Matt concluded, synthesists need to become more
warranted. aware, and take deliberate account, of the multiple sources
In their reanalysis of the Smith, Glass, and Miller psy- of uncertainty in reading, understanding, and interpreting
chotherapy data (1980), Orwin and Cordray computed research reports (1989). But, in keeping with the scientific
198 CODING THE LITERATURE

Table 10.8 Reliability Estimates

Variable Run 1 Run 2 Run 3 Run 4

Diagnosis: neurotic, phobic, or depressive .98 .98 .89 .89


Diagnosis: delinquent, felon, or habitue 1.00 1.00 1.00 1.00
Diagnosis: psychotic 1.00 1.00 1.00 1.00
Clients self-presented .97 .57 .71 .71
Clients solicited .93 .86 .81 .81
Individual therapy 1.00 1.00 .85 .85
Group therapy .98 .96 .94 .94
Client IQ .69 .69 .60 .60
Client agea .99 .99 .91 .91
Therapist experience  neurotic diagnosis .76 .75 .70 .70
Therapist experience  delinquent diagnosis 1.00 1.00 1.00 1.00
Internal validity .76 .71 .42 .42
Follow-up timeb .99 .99 .95 .95
Outcome typec .87 .70 .76 .76
Reactivityd .57 .56 .57 .57
ES 1.00 1.00 1.00 .78

source: Orwin and Cordray 1985.


notes: Reliability-Corrected Regression Runs on the Smith, Glass, and Miller (1980) Psychotherapy Data
aTransformed age  (age  25)( | age  25|)1/2.
bTransformed follow-up  (follow-up)1/2.
c Other category removed for purpose of dichotomization.
dTransformed reactivity  (reactivity)2.25.

endeavor principle, they also need to begin to move be- like effect size, interim codes should be recorded. One
yond observation of these phenomena and toward expla- code might simply indicate what was reported in the orig-
nation. For example, further research on evaluating cod- inal study. A second code might indicate what decision
ing decisions in research synthesis should look beyond rule was invoked in further interpretation of the data
how to assess IRR toward better classification and under- point. A third coderepresenting the endpoint and the
standing of why coders disagree. Evidence from other variable of interest to the synthesis (effect size, for ex-
fields could probably inform some initial hypotheses. ample)would be the number that results from the cod-
Horwitz and Yu conducted detailed analyses of errors in ers application of that interpretation. Coder disagreement
the extraction of epidemiological data from patient re- on the third code would no longer be a black box, but
cords and suggested a rough taxonomy of six error cate- could be readily traced to its source. Coder disagreement
gories (1984). Of the six, four were errors in the actual could then be broken down to its component parts (that is,
data extraction, and two were errors in the interpretation the errors could be partitioned by source), facilitating di-
of the extracted data. The two interpretation errors were agnosis much like a failure analysis in engineering or
incorrect interpretation of criteria and correct interpreta- computer programming. The detailed documentation of
tion of criteria, but inconsistent application. Note the coding decision rules could be made for examination and
similarity between these types of errors and the types of reanalysis by other researchers (much like the list of in-
judgment errors described in section 10.2 with regard to cluded studies is currently), including any estimation
coding in research synthesis. procedures used for missing or incomplete data (for an
When the coder must make judgment calls, the ratio- example, see Bullock and Svyantek 1985). Theoretical
nale for each judgment should be documented. In this models could also be tested and, most important, more
way, we can begin to formulate a general theory of the informed strategies for reducing and controlling for error
coder judgment process. For complex yet critical variables rates would begin to emerge.
EVALUATING CODING DECISIONS 199

10.5.2 Assessing Time and Necessity will be universally present or absent, but rather will inter-
act with the topic area (for example, coder gender is more
The coding stage of research synthesis is time-consuming likely to matter when the topic is sex differences than
and tedious, and in many instances additional efforts to when it is acid rain) or with other factors. Research could
evaluate coding decisions further lengthen and compli- shed more light on these interactions as well. In sum, ad-
cate it. The time and resources of practicing synthesists is ditional efforts to assess both the time and need for spe-
limited, and they would likely want answers to two ques- cific methods of evaluating coding decisions could sig-
tions before selecting a given strategy: How much time nificantly help the practicing synthesist make cost-effective
will it take? How necessary is it? choices in the synthesis design.
The time issue has not been systematically studied.
Orwin and Cordray included a small time-of-task side 10.6 NOTES
study and concluded that the marginal cost in time of
augmenting the task as they did (double codings plus 1. It should be noted that significant reporting deficien-
confidence ratings) was quite small given the number of cies, and their deleterious effect on research synthesis,
additions involved and the distinct impact they had on are not confined to social research areas (see, for ex-
the analytic results (1985). In part, this is because though ample, synthesis work on the effectiveness of coronary
double coding a study approximately doubles the time artery bypass surgery in Wortman and Yeaton 1985).
required to code it, in an actual synthesis only a sample 2. For example, Oliver (1987) reports on a colleague who
of studies may be double coded. In a synthesis the size of attempted to apply a meta-analytic approach to re-
Smith, Glass, and Millers, the double coding of twenty- search on senior leadership, but gave up after finding
five representative studies would increase the total num- that only 10 percent of the eligible studies contained
ber of studies coded by only 5 percent. When a particular sufficient data to sustain the synthesis.
approach to reliability assessment leads to low occur- 3. That detailed criteria for decisions fail to eliminate
rence rates or restricted range on key variablesas can coder disagreement should not be surprising, even if
happen with Yeaton and Wortmans hierarchical ap- all known contingencies are incorporated (for an ana-
proach (1991)oversampling techniques can be used. log from coding patient records in epidemiology, see
Research that estimated how often oversampling was re- Horwitz and Yu 1984).
quired would in itself be useful. Other possibilities are 4. Specifically, this rule aims to exclude effect sizes
easy to envision. In short, this is an area of considerable based on redundant outcome measures. A measure
practical importance that has not been seriously exam- was judged to be redundant if it matched another in
ined to date. outcome type, reactivity, follow-up time, and magni-
The question of necessity could be studied the same tude of effect.
way that methodological artifacts of primary research 5. The Chalmers et al. (1987) review was actually a re-
have been studied. For example, a handful of syntheses view of replicate meta-analyses rather than a meta-
have used coder masking techniques, as described in sec- analysis per se (that is, the individual studies were
tion 10.2. Keeping coders unaware of hypotheses or other meta-analyses), but the principle is equally applicable
information is an additional complication that adds to the to meta-analysis proper.
cost of the effort and may have unintended side effects. 6. Here, and elsewhere in the chapter, we rely on meta-
An empirical test of the effect of selective masking on analytic work in the most recent complete year of
research synthesis coding would be a useful contribution, Psychological Bulletin as a barometer of current prac-
as such studies have been in primary research, to deter- tice.
mine whether the particular artifact that masking is in- 7. Such improvement is quite plausible. For example,
tended to guard against (coder bias) is real or imaginary, Orwin (1983b) found that measures of reporting quality
large or trivial, and so forth. As was the case with pretest were positively correlated with publication date in the
sensitization and Hawthorne effects in the evaluation lit- Smith, Glass, and Miller (1980) psychotherapy data.
erature (Cook and Campbell 1979), research may show 8. The rationale for using multiple estimates is discussed
that some suspected artifacts in synthesis coding are only in section 10.4.3.3.
thatperfectly plausible yet not much in evidence when 9. There is nothing anomalous about the present coders
subjected to empirical scrutiny. In reality, few artifacts relative lack of agreement on internal validity; lack of
200 CODING THE LITERATURE

consensus on ratings of research quality is common- Rennie, Kennteh F. Schulz, David Simel, and Donna F.
place (for an example within meta-analysis, see Stock Stroup. 1996. Improving the Quality of Reporting of
et al. 1982). Randomised Controlled Trials. The CONSORT State-
10. This desirable characteristic has led some writers ment. Journal of the American Medical Association
(for example, Hartmann 1977) to advocate extend- 276(8): 63739.
ing the use of r, in the form of the phi coefficient Bullock, R. J., and Daniel J. Svyantek. 1985. Analyzing
(), to reliability estimation of dichotomous items: Meta-Analysis: Potential Problems, an Unsuccessful Re-
 (BC  AD) / ((A  B)(C  D)(A  C)(B  D))1/2, plication, and Evaluation Criteria. Journal of Applied Psy-
where A, B, C, and D represent the frequencies in the chology 70(1): 10815.
first through fourth quadrants of the resulting 2  2 Buros, Oscar K. 1978. Mental Measurements Yearbook,
table. 8th ed. Highland Park, N.J.: Gryphon Press. Available at
11. This phenomenon is easily demonstrated with the http://www.unl.edu/buros/bimm/html/catalog.html.
illustrative data; the details are left to the reader as an Burton, Nancy W. 1981. Estimating Scorer Agreement for
exercise. Nominal Categorization Systems. Educational and Psy-
12. The general forms of the equations are chological Measurement 41(4): 95361.
Carey, Gregory, and Irving I. Gottesman. 1978. Reliability
BMS  WMS
r1(design 1)  ________________
BMS  (k  1)WMS
, and Validity in Binary Ratings: Areas of Common Misun-
BMS  EMS derstanding in Diagnosis and Symptom Ratings. Archives
r1(design 1)  ________________________________
BMS  (k  1)EMS  k(CMS  EMS)/n
,
of General Psychiatry 35(12): 145459.
BMS  EMS Chalmers, Thomas C., Jayne Berrier, Henry S. Sacks,
r1(design 1)  ________________
BMS  (k  1)EMS
,
Howard Levin, Dinah Reitman, and Raguraman Nagalin-
where k is the number of coders rating each study gham. 1987. Meta-Analysis of Clinical Trials as a Scienti-
and n is the number of studies. fic Discipline, II. Replicate Variability and Comparison of
13. rI is closely related to the coefficient of generalizabil- Studies that Agree and Disagree. Statistics in Medicine
ity and index of dependability used in G theory (Shav- 6(7): 73344.
elson and Webb 1991). Cicchetti, Domenic V. 1985. A Critique of Whitehursts In-
14. Training and piloting were completed before the terrater Agreement for Journal Manuscript Reviews: De
rater drift assessment began, to avoid picking up ob- Omnibus Disputandem Est. American Psychologist 40(5):
vious training effects. 56368.
15. The term case is used rather than study because a Cicchetti, Dominic V., and Sara S. Sparrow. 1981. Deve-
study can have multiple cases (one for each effect loping Criteria for Establishing the Interrater Reliability of
size). The term observation refers to the value as- Specific Items in a Given Inventory. American Journal of
signed to a particular variable within a case. Mental Deficiency 86: 12737.
16. Chapter 10 shows how the state of the art has evolved Cohen, Jacob. 1960. A Coefficient of Agreement for Nomi-
(see also U.S. General Accounting Office 1989, nal Scales. Educational and Psychological Measurement
appendix II). 20(1): 3746.
Cohen, Jacob. 1968. Weighted Kappa: Nominal Scale
10.7 REFERENCES Agreement with Provision for Scaled Disagreement or Par-
tial Credit. Psychological Bulletin 70(4): 21320.
American Psychiatric Association, Committee on Nomen- Cohen, Jacob, and Patricia Cohen. 1975. Applied Multiple
clature and Statistics. 1980. Diagnostic and Statistical Regression / Correlation Analysis for the Behavioral Scien-
Manual of Mental Disorders, 3rd ed. Washington, D.C.: ces. Hillsdale, N.J.: Lawrence Erlbaum.
American Psychiatric Association. Cook, Thomas D., and Donald T. Campbell. 1979. Quasi-
Andrs, A.Martn, and P. Fernina Marzo. 2004. Delta: A Experimentation: Design and Analysis Issues for Field Set-
New Measure of Agreement Between Two Raters. British tings. Chicago: Rand McNally.
Journal of Mathematical and Statistical Psychology 57(1): Cordray, David S., and Robert G. Orwin. 1981. Technical
119. Evidence Necessary for Quantitative Integration of Research
Begg, Colin., Mildred Cho, Susan Eastwood, Richard Hor- and Evaluation. Paper presented at the joint conference of
ton, David Moher, Ingram Olkin, Roy Pitkin, Drummond the International Association of Social Science Information
EVALUATING CODING DECISIONS 201

Services and Technology and the International Federation Higgins, Julian P.T., and Sally Green, eds. 2005. Cochrane
of Data Organizations. Grenoble, France (September 1981). Handbook for Systematic Reviews of Interventions, ver.
Cordray, David S., and L. Joseph Sonnefeld. 1985. Quanti- 4.2.5. Chichester: John Wiley & Sons. http://www.cochrane
tative synthesis: An actuarial base for planning impact .org/resources/handbook/hbook.html.
evaluations. In Utilizing Prior Research in Evaluation Horwitz, Ralph I., and Eunice C. Yu 1984. Assessing the
Planning, edited by David S. Cordray. New Directions for Reliability of Epidemiological Data Obtained from Medi-
Program Evaluation No. 27. San Francisco: Jossey-Bass. cal Records. Journal of Chronic Disease 37(11): 82531.
Cronbach, Lee J., G. C. Gleser, H. Nanda, and N. Rajarat- Hyde, Janet S. 1981. How Large Are Cognitive Gender Dif-
nam. 1972. Dependability of Behavioral Measurements. ferences? A Meta-Analysis Using W and D. American
New York: John Wiley & Sons. Psychologist 36(8): 892901.
Eagly, Alice H., and Carli, Linda L. 1981. Sex of Resear- Jackson, Gregg B. 1980. Methods for Integrative Reviews.
chers and Sex-Typed Communications as Determinants of Review of Educational Research 50(3): 43860.
Sex Differences in Influenceability: A Meta-Analysis of Janda, Kenneth 1970. Data Quality Control and Library
Social Influence Studies. Psychological Bulletin 90(1): Research on Political Parties. In A Handbook of Method in
120. Cultural Anthropology, edited by R. Narall and R. Cohen.
Ebel, Robert L. 1951. Estimation of the Reliability of New York: Doubleday.
Ratings. Psychometrika 16(4): 40724. Janes, Cynthia L. 1979. Agreement Measurement and the
Eyesenck, Hans. J. 1978. An Exercise in Mega-Silliness. Judgment Process. Journal of Nervous and Mental Disor-
American Psychologist 33(5): 517. ders 167(6): 34347.
Fleiss, Jospeh L. 1971. Measuring Nominal Scale Agree- Jones, Allan P., Lee A. Johnson, Mark C. Butler, and
ment Among Many Raters. Psychological Bulletin 76(5): Deborah S. Main 1983. Apples and Oranges: An Empiri-
37882. cal Comparison of Commonly Used Indices of Interrater
Fleiss, Joseph L. 1981. Statistical Methods for Rates and Agreement. Academy of Management Journal 26(3):
Proportions, 2nd ed. New York: John Wiley & Sons. 50719.
Fleiss, Joseph L., and Jacob Cohen. 1973. The Equivalence Journal Article Reporting Standards Working Group. 2007.
of Weighted Kappa and the Intraclass Correlation Coeffi- Reporting Standards for Research in Psychology: Why Do
cient as Measures of Reliability. Educational and Psycho- WE Need Them? What Might They Be? Report to the
logical Measurement 33(3): 61319. American Psychological Association Publications & Com-
Fleiss, Joseph L., Jacob Cohen, and B. S. Everitt. 1969. munications Board. Washington, D.C.: American Psycho-
Large Sample Standard Errors of Kappa and Weighted logical Association.
Kappa. Psychological Bulletin 72(5): 32327. Kazdin, Alan E. 1977. Artifacts, Bias, and Complexity of
Glass, Gene V., and Mary L. Smith. 1979. Meta-Analysis of Assessment: The ABCs of research. Journal of Applied
Research on the Relationship of class Size and Achieve- Behavior Analysis 10(1): 14150.
ment. Educational Evaluation and Policy Analysis 1(1): Kerlinger, Fred N., and Elazar E. Pedhazur. 1973. Multiple
216. Regression in Behavioral Research. New York: Holt, Rine-
Green, Bert F., and Judith A. Hall. 1984. Quantitative hart and Winston.
Methods for Literature Reviews. Annual Review of Psy- Kulik, James A., Chen-lin C. Kulik, and Peter A. Cohen.
chology 35: 3753. 1979. Meta-Analysis of Outcome Studies of Kellers Per-
Grove, William M., Nancy C. Andreasen, Patricia Mc- sonalized System of Instruction. American Psychologist
Donald-Scott, Martin B. Keller, and Robert W. Shapiro 1981. 34(4): 30718.
Reliability Studies of Psychiatric Diagnosis: Theory and Light, Richard J. 1971. Measures of Response Agreement
Practice. Archives of General Psychiatry 38(4): 40813. for Qualitative Data: Some Generalizations and Alternati-
Hartmann, Donald P. 1977. Considerations in the Choice of ves. Psychological Bulletin 76: 36577.
Interobserver Reliability Estimates. Journal of Applied Light, Richard J., and David B. Pillemer. 1984. Summing
Behavior Analysis 10(1): 10316. Up: The Science of Reviewing Research. Cambridge, Mass.:
Hays, William L. 1994. Statistics, 5th ed. Fort Worth, Tex.: Harvard University Press.
Harcourt Brace College. Lord, Frederic M., and Melvin R. Novick. 1968. Statistical
Hedges, Larry V., and Ingram Olkin. 1985. Statistical Methods Theories of Mental Test Scores. Reading, Mass.: Addison-
for Meta-Analysis. Orlando, Fl.: Academic Press. Wesley.
202 CODING THE LITERATURE

Matt, Georg E. 1989. Decision Rules for Selecting Effect Scott, William A. 1955. Reliability of Content Analysis:
Sizes in Meta-Analysis: A Review and Reanalysis of Psy- The Case of Nominal Scale Coding. Public Opinion Quar-
chotherapy Outcome Studies. Psychological Bulletin terly 19(3): 32125.
105(1): 10615. Selvage, Rob. 1976. Comments on the Analysis of Variance
McGuire, Joanmarie., Gary W. Bates, Beverly J. Dretzke, Strategy for the Computation of Intraclass Reliability. Edu-
Julia E. McGivern, Karen L. Rembold, Daniel R. Seobold, cational and Psychological Measurement 36(3): 6059.
Betty Ann M. Turpin, and Joel R. Levin. 1985. Methodo- Shapiro, David A., and Diana Shapiro. 1982. Meta-Analysis
logical Quality as a Component of Meta-Analysis. Educa- of Comparative Therapy Outcome Studies: A Replication
tional Psychologist 20(1): 15. and Refinement. Psychological Bulletin 92(3): 581604.
McLellan, A. Thomas, Lester Luborsky, John Cacciola, Shavelson, Richard J., and Noreen M. Webb 1991. Gene-
Jason Griffith, Peggy McGahan, and Charles OBrien. ralizability Theory: A Primer. Newbury Park, Calif.: Sage
1988. Guide to the Addiction Severity Index: Background, Publications.
Administration, and Field Testing Results. Philadelphia, Shrout, Patrick E., and Joseph L. Fleiss. 1979. Intraclass
Pa.: Veterans Administration Medical Center. Correlations: Uses in Assessing Rater Reliability. Psycho-
Moher, David, Kenneth F. Schulz, Douglas G. Altman, and logical Bulletin 86(2): 42028.
CONSORT Group 2001. The CONSORT Statement: Shrout, Patrick E., Spitzer, R. L., and Joseph L. Fleiss. 1987.
Revised Recommendations for Improving the Quality of Quantification of agreement in psychiatric diagnosis revisi-
Reports of Parallel-Group Randomised Trials. Lancet ted. Archives of General Psychiatry 44(2): 17277.
357(9263): 119194. Smith, Mary L. 1980. Sex Bias in Counseling and Psycho-
Nunnally. Jum C. 1978. Psychometric Theory 2nd ed. New therapy. Psychological Bulletin 87: 392407.
York: McGraw-Hill. Smith, Mary L., and Gene V. Glass. 1977. Meta-Analysis of
Oliver, Laurel W. 1987. Research Integration for Psycholo- Psychotherapy Outcome Studies. American Psychologist
gists: An Overview of Approaches. Journal of Applied 32: 75260.
Social Psychology 17(10): 86074. Smith, Mary L., Gene V. Glass, and Thomas I. Miller. 1980.
Orwin, Robert G. 1983a. A Fail-Safe N for Effect Size in The Benefits of Psychotherapy. Baltimore, Md.: Johns Ho-
Meta-Analysis. Journal of Educational Statistics 8(2): pkins University Press.
15759. Spitznagel, Edward L., and John E. Helzer 1985. A Pro-
. 1983b. The Influence of Reporting Quality in Pri- posed Solution to the Base Rate Problem in the Kappa Sta-
mary Studies on Meta-Analytic Outcomes: A Conceptual tistic. Archives of General Psychiatry 42(7): 72528.
Framework and Reanalysis. PhD diss. Northwestern Uni- Stanley, Julian C. 1971. Reliability. In Educational Measu-
versity. rement, 2d ed., edited by R. L. Thorndike. Washington,
. 1985. Obstacles to Using Prior Research and Eva- D.C.: American Council on Education.
luations. In Utilizing Prior Research in Evaluation Plan- Stock, William A., Morris A. Okun, Marilyn J. Haring,
ning, edited by D. S. Cordray. New Directions for Program Wendy Miller, Clifford Kenney, and Robert C. Ceurvost.
Evaluation No. 27. San Francisco: Jossey-Bass. 1982. Rigor in Data Synthesis: A Case Study of Reliability
Orwin, Robert, and David S. Cordray. 1985. Effects of De- in Meta-Analysis. Educational Researcher 11(6): 1020.
ficient Reporting on Meta-Analysis: A Conceptual Fra- Terpstra. David E. 1981. Relationship Between Methodo-
mework and Reanalysis. Psychological Bulletin, 97(1): logical Rigor and Reported Outcomes in Organization
13447. Development Evaluation Research. Journal of Applied
Rothstein, Hannah R., Alexander J. Sutton, and Michael Bo- Psychology 66: 54143.
renstein, eds. 2005. Publication Bias in Meta-Analysis: U.S. General Accounting Office. 1989. Prospective Evalua-
Prevention, Assessment and Adjustments. Chichester: John tion Methods: The Prospective Evaluation Synthesis. GAO/
Wiley & Sons. PEMD Transfer Paper 10-1-10. Washington, D.C.: Govern-
Rosenthal, Robert 1978. Combining Results of Indepen- ment Printing Office.
dent Studies. Psychological Bulletin 85: 18593. Valentine, Jeffrey C., and Harris Cooper. 2008. A Systema-
Sacks, Harold S., Jayne Berrier, Dinah Reitman, Vivette A. tic and Transparent Approach for Assessing the Methodo-
Ancona-Berk, and Thomas C. Chalmers 1987. Meta-Ana- logical Quality of Intervention Effectiveness Research: The
lyses of Randomized Controlled Trials. New England Study Design and Implementation Assessment Device
Journal of Medicine 316(8): 45055. Study DIAD. Psychological Methods 13(2): 13049.
EVALUATING CODING DECISIONS 203

Whitehurst, Grover J. 1984. Interrater Agreement for Jour- Artery Bypass Graft Surgery. Controlled Clinical Trials
nal Manuscript Reviews. American Psychologist 39(1): 6(4): 289305.
2228. Yeaton, William H., and Paul M. Wortman. 1991. On the
Wortman, Paul M., and Fred B. Bryant. 1985. School Dese- Reliability of Meta-Analytic Reviews: The Role of Interco-
gregation and Black Achievement: An Integrative Review. der Agreement. Unpublished manuscript.
Sociological Methods and Research 13(3): 289324. Zwick. Rebecca. 1988. Another Look at Interrater Agree-
Wortman. Paul M., and William H. Yeaton. 1985. Cumula- ment. Psychological Bulletin 103(3): 37478.
ting Quality of Life Results in Controlled Trials of Coronary
11
VOTE-COUNTING PROCEDURES IN META-ANALYSIS

BRAD J. BUSHMAN MORGAN C. WANG


University of Michigan and VU University University of Central Florida

CONTENTS
11.1 Introduction 208
11.2 Counting Positive and Significant Positive Results 208
11.3 Vote-Counting Procedures 208
11.3.1 The Conventional Vote-Counting Procedure 208
11.3.2 The Sign Test 209
11.3.3 Confidence Intervals Based on Vote-Counting Procedures 210
11.3.3.1 Population Standardized Mean Difference  211
11.3.3.2 Population Correlation Coefficient  212
11.4 Combining Vote Counts and Effect Size Estimates 214
11.4.1 Population Standardized Mean Difference  214
11.4.2 Population Correlation Coefficient  215
11.5 Applications of Vote-Counting Procedures 216
11.5.1 Publication Bias 216
11.5.2 Results All in the Same Direction 217
11.5.3 Missing Data 219
11.6 Conclusions 219
11.7 Notes 220
11.8 References 220

207
208 STATISTICALLY DESCRIBING STUDY OUTCOMES

11.1 INTRODUCTION These types are rank ordered from most to least in terms
of the amount of information they contain (Hedges 1986).2
As the number of scientific studies continues to grow, it Effect size procedures should be used for the studies that
becomes increasingly important to integrate the results contain enough information to compute an effect size es-
from these studies. One simple approach involves count- timate (see chapters 12 and 13, this volume). Vote-count-
ing votes. In the conventional vote-counting procedure, ing procedures should be used for the studies that do not
one simply divides studies into three categories: those contain enough information to compute an effect size es-
with significant positive results, those with significant timate but do contain information about the direction and
negative results, and those with nonsignificant results. the statistical significance of results, or that contain just
The category containing the most studies is declared the the direction of results. We recommend that vote-count-
winner. For example, if the majority of studies examining ing procedures never be used alone unless none of the
a treatment found significant positive results, then the studies contain enough information to compute an effect
treatment is considered to have a positive effect. size estimate. Rather, vote-counting procedures should
Many authors consider the conventional vote-counting be used in conjunction with effect size procedures. As we
procedure to be crude, flawed, and worthless (see Fried- describe in section 11.4, effect size estimates and vote-
man 2001; Jewell and McCourt 2000; Lee and Bryk count estimates can be combined to obtain a more precise
1989; Mann 1994; Rafaeli-Mor and Steinberg 2002; Sa- overall estimate of the population effect size.
roglou 2002; Warner 2001). Take, for example, the title of
one article: Why Vote-Count Reviews Don't Count 11.2 COUNTING POSITIVE AND SIGNIFICANT
(Friedman 2001). We agree that the conventional vote- POSITIVE RESULTS
counting procedure can be described in these ways. But
all vote-counting procedures are not created equal. The Although some studies do not contain enough informa-
vote-counting procedures described in this chapter are far tion to compute an effect size estimate, they may still pro-
more sophisticated than the conventional procedure. vide information about the magnitude of the relationship
These more sophisticated procedures can have an impor- between the independent and dependent variables.3 Often
tant place in the meta-analysts toolbox. this information is in the form of a report of the decision
When authors use both vote-counting procedures and yielded by a significance test (for example, a significant
effect size procedures with the same data set, they quickly positive mean difference) or in the form of the direction
discover that vote-counting procedures are less powerful of the effect without regard to its statistical significance
(see Dochy et al. 2003; Jewell and McCourt 2000; Saro- (for example, a positive mean difference). The first form
glou 2002). However, vote-counting procedures should of information corresponds to whether the test statistic
never be used as a substitute for effect size procedures. exceeds a conventional critical value at a given signifi-
Research synthesists generally have access to four types cance level, such as   .05. The second form of informa-
of information from studies: tion corresponds to whether the tests statistic exceeds the
rather unconventional level   .5. For the vote-counting
a reported effect size, procedures described in this chapter, we consider both
the   .05 and   .5 significance levels.
information that can be used to compute an effect
size estimate (for example, raw data, means and stan-
dard deviations, test statistic values), 11.3 VOTE-COUNTING PROCEDURES
information about whether the hypothesis test found 11.3.1 The Conventional Vote-Counting
a statistically significant relationship between the in- Procedure
dependent and dependent variables, and the direction
of that relationship (for example, a significant posi- Richard Light and Paul Smith were among the first to
tive mean difference), and describe a formal procedure for taking a vote of study
results:
information about only the direction of the relation-
ship between the independent and dependent vari- All studies which have data on a dependent variable and a
ables (for example, a positive1 mean difference). specific independent variable of interest are examined.
VOTE COUNTING PROCEDURES IN META-ANALYSIS 209

Three possible outcomes are defined. The relationship be- are all zero. The sign test, the oldest nonparametric test
tween the independent variable and the dependent variable (dating back to 1710), is simply the binomial test with
is either significantly positive, significantly negative, or probability   .5 (Conover 1980, 122). If the best esti-
there is no specific relationship in either direction. The mate for each effect size is zero, the probability of getting
number of studies falling into each of these three categories
a positive result from a binomial test of sampled effect
is then simply tallied. If a plurality of studies falls into any
one of these three categories, with fewer falling into the size estimates equals .5. If the effect size is expected to be
other two, the modal category is declared the winner. This greater than zero (as when a treatment has a positive
modal category is then assumed to give the best estimate of effect), the probability of getting a binomial test result is
the direction of the true relationship between the indepen- greater than .5. Thus, the null (H0) and alternative (HA)
dent and dependent variable. (1971, 433) hypotheses are

The conventional vote-counting procedure Light and H0:   .5


Smith described has been criticized on several grounds (11.1)
HA:  .5
(1971). First, the conventional vote-counting procedure
does not incorporate sample size into the vote. As sample
where  is the proportion of positive results in the pop-
size increases, the probability of finding a statistically
ulation. If one is counting significant positive results
significant relationship between the independent and de-
rather than positive results, use   .05 rather than
pendent variables also increases. Thus, a small study with
  .5 in (11.1). If U equals the number of positive re-
a nonsignificant result receives as much weight as a large
sults in n independent studies, then an estimate of  is
study with a significant result. Second, although the con-
pU/k.
ventional vote-counting procedure allows one to deter-
A (1  )  100 percent confidence interval for  is
mine which modal category is the winner, it does not
(Brown, Cai, and DasGupta 2001)
allow one to determine what the margin of victory is. If
you compare it to a horse race, did the winner barely win

by a nose, or did the winner impressively win by a length
( ( )
2
U  __ __
2
or more? In other words, the traditional vote-counting
2
________
(n   )
2 
(n )
_______
(n  2) (
 p q  ___
4n
, )
procedure does not provide an effect size estimate. Third,

( )
)
2
the conventional vote-counting procedure has very low U  __ __
2
power for small to medium effect sizes and sample sizes
2
________
(n  2)

(n )
_______
(n  2) (
 p q  ___
4n ), (11.2)
(Hedges and Olkin 1980). Moreover, for such effect sizes,
the power of the conventional vote-counting procedure
decreases to zero as the number of studies to be integrated where   Z/2  1 (1  /2), p __ U , and q 1  p.
For
n
increases (Hedges and Olkin 1980). Unfortunately, many a 95 percent confidence interval,   1.96.
effects and samples in the social sciences are small to me- Example: Sign Test. In data set II, there are eleven
dium in size. For example, in the field of social psychol- positive results in the nineteen studies (58 percent). The
ogy a synthesis of 322 meta-analyses of more than 25,000 estimate of  is therefore 11/19  .58. When   .5, it is
studies involving more than 8 million participants found found from a binomial distribution table that the corre-
an average correlation of 0.21 (Richard, Bond, and Stokes- sponding area is .0593 (that is, .0417  .0139  .0032 
Zoota 2003). According to Jacob Cohen (1988), a small .0005  .0000) Thus, we would fail to reject the null hy-
correlation is 0.1, a medium correlation is 0.3, and a large pothesis that   .5 at the .05 significance level (because
correlation is .5. Thus, the 0.21 correlation is between the area .0593 is greater than .05). The corresponded 95
small and medium in size. percent Wilson confidence interval is (0.456 0.676) Be-
cause the confidence interval includes the value .5, we
fail to reject the null hypothesis that   .5 at the .05 sig-
11.3.2 The Sign Test nificance level.
This sign test is based on the simple direction of each
The sign test can be used to test the hypothesis that the study result. Like the conventional vote-counting proce-
effect sizes from a collection of k independent studies dure, it does not incorporate sample size in the vote. It
210 STATISTICALLY DESCRIBING STUDY OUTCOMES

also does not provide an effect size estimate. The proce- are two possible outcomes: a success if Ti  Ci, or a fail-
dures described in the next section overcome both of ure Ti Ci. Let Xi be an indicator variable such that
these weaknesses. 1 if Ti  C
Xi  { 0 if Ti C
i

i
(11.6)
11.3.3 Confidence Intervals Based If one counts the number of positive results, the prob-
on Vote-Counting Procedures ability i is
Suppose that the synthesist has a collection of k indepen-
dent studies that test whether a relationship exists be- i  Pr(Xi  1)  Pr( Ti  C  .5,  ) i
(11.7)
tween the independent and dependent variables. The syn-
thesist wants to test whether the effect size for each study, where i are the degrees of freedom for the ith study at the
assumed to be the same for each study, is zero. Let T1, . . . ,   .5 significance level. If one counts the number of sig-
Tk be the independent effect size estimators for parame- nificant positive results, the probability i is
ters 1, . . . , k . If the effect sizes are assumed to be ho-
mogeneous, the null and alternative hypotheses are i  Pr(Xi  1)  Pr( Ti  C  .05,  ) i
(11.8)

H0: 1  . . .  k  0 If T is approximately normally distributed with mean 


(11.3) and variance vi, then the probability  can be expressed
HA: 1  . . .  k    0.
as an explicit function of ,
The null hypothesis in (11.3) is rejected if the estima-
i  Pr( Ti  C )
tor T of the common effect size  exceeds the one-sided4 i

__ __
critical value C. The 100(1  ) percent confidence in- Pr[ (Ti)/vi  (C )/vi ] i

terval for  is given by __


1[ (C )/vi ],
i
(11.9)
T  C/2 ST  T  C/2ST , (11.4)
where (x) is the standard normal distribution function.
where C/2 is the two-sided critical value from the distri- Solving (11.9) for  gives
bution of T at significance level , and ST is the standard __
error of T.   C  vi 1(1  i). (11.10)
The essential feature of vote-counting procedures is
that the values T1, . . . , Tk are not observed. If we do not Equation 11.10 provides a general large-sample ap-
observe Ti, we simply count the number of times Ti ex- proximate relationship between a proportion and an effect
ceeded the one-sided critical value C. For   .5, we size. It applies to effect sizes that are normally distributed
count the number of studies with positive results, and in large samples.
use the null and alternative hypotheses in (11.1). For The method of maximum likelihood can be used to
  .05, we count the number of studies with significant obtain an estimator of  by observing whether Ti  C i

positive results, and use the null and alternative hypoth- in each of the k studies. The log-likelihood function is
eses given by

L()  i1[xilog(i)] [(1  xi)log(1  i)],


k
(11.11)
H0:   .05
(11.5)
HA:   .05. where xi is the observed value of Xi (that is 1 or 0) and
log(x) is the natural logarithmic function. The log-likeli-
(If   0, the probability of getting a significant positive hood function L() can be maximized over  to obtain the
result equals .05.) maximum likelihood estimator (MLE) . Generally, how-
To test the hypotheses in (11.1) and (11.5) we need an ever, there is no closed form expression for the MLE ,
estimator of . The method of maximum likelihood can and  must be obtained numerically. One way to obtain 
be used to obtain an estimator of . For each study there is to compute the log-likelihood function L() for a coarse
VOTE COUNTING PROCEDURES IN META-ANALYSIS 211

grid of  values and use the maximum value of L() to ith study. If we assume the effect sizes are homogeneous,
determine the region where  lies. Then a finer grid of  we test the hypotheses
values can be used to obtain a more precise estimate of .

When the number of studies combined is large,  is ap- H0: 1  . . .  k  0
proximately normally distributed, and the large-sample (11.17)
HA: 1  . . .  k    0.
variance of  is given by (Hedges and Olkin 1985, 70)
1
__ Suppose that we observe whether the mean difference
{ }
k [D(1) 2
i ] __
Var()
 ________
i1 i(1 i)
, (11.12) YiE  YiC is positive for each study in a collection of k in-
dependent studies. If we assume that the effect __ __sizes are
where Di(1) is the first order derivative of i evaluated at homogeneous, then the mean difference YiE  YiC is ap-
  is proximately normally distributed with mean 
i and vari-
__ __
(C )2 ance
i2/ ni, where ni  niE niC/(niE  niC ). The probability
D(1)

i   
S ()  S
()
___i ____________
i i
___
Si2()2 (
exp __12 ________
Si2()
,
i

) (11.13) of a positive result is


__ __
where Si
()  Si()/, evaluated at  and the variance v i  Pr(Ti  C.5, v )  Pr(YiE  YiC  0)
i

of the effect size estimate can be expressed as a function __


1(n i ), (11.18)
of the effect size parameter  via v  [S()]2 .
The upper and lower bounds of a 100(1  )% __
Note that i is a function of both  and ni. The log-like-
confidence interval for the population effect size  are
lihood function is
given by,
k __
______ ______
L()  i1[xilog[1(n i )]]
  z/2Var()
   z/2Var()
, (11.14) __
[(1xi)log(n i )], (11.19)
where z/2 is the two-sided critical value from the stan-
dard normal distribution at significance level . and the derivative is
Section 11.3.3.1 describes how to obtain an estimate __
 i n
and a confidence interval for the population standardized
mean difference . Section 11.3.3.2 illustrates how to ob-
D(1) ___i ___ 1 2
__
(
i  .
i    2 exp  2 n ) (11.20)

tain an estimate and a confidence interval for the popula- The large sample variance of  is obtained by substi-
tion correlation coefficient . tuting equation 11.18 for i and equation 11.20 for Di(1) in
11.3.3.1 Population Standardized Mean Difference equation 11.12.
If the independent variable is dichotomous (for exam- Example: Estimating  from the Proportion of Positive
ple, experimental vs. control group), one common effect Results. Suppose we wish to estimate  based on the pro-
size for the ith study is the population standardized mean portion of positive results in data set II. Table 11.1 gives
difference __
the values of nE, nC, n, d and X for the nineteen studies in
E  C data set II. Inserting these values into equation 11.19 for
i  _______
i i

i , i 1, . . . , k, (11.15) values of  ranging from 0.50 to 0.50 gives the coarse
log-likelihood function values on the left side of table
where iE and iC are the respective population means for 11.2. The coarse grid of  values reveals that the maxi-
the experimental and control groups in the ith study, and mum value of L() lies between 0.10 and 0.10. To ob-

i is the common population standard deviation in the ith tain an estimate of  to two decimal places, we compute
study. The sample estimator of i is the log-likelihood function values between 0.10 and
__
YiE  YiC
__
0.10 in steps of 0.01. The fine grid of  values is given on
di  _______
Si
, i 1, . . . , k, (11.16) the right side of table 11.2. Because the log-likelihood
__ __ function is largest at   0.02, the MLE is   0.02.
where YiE and YiC are the respective sample means for the The computations for obtaining the large sample vari-
experimental and control groups in the ith study, and Si is ance of  are given in table 11.3. Substituting these values
the pooled within-group sample standard deviation in the in equation 11.12 yields the Var( )
 1/473.734  0.00211.
212 STATISTICALLY DESCRIBING STUDY OUTCOMES

Table 11.1 Experimental and Control Group Sample Table 11.2 Log-Likelihood Function Values for
Sizes, Standardized Mean Differences, and
Indicator Variable Values Coarse Grid Fine Grid

Study nE nC  nEi  nCi /(nEi  nCi ) d X  L()  L()

1 77 339 62.748 0.03 1 0.50 52.430 0.10 15.259


2 60 198 46.047 0.12 1 0.40 39.135 0.09 14.912
3 72 72 36.000 0.14 0 0.30 28.409 0.08 14.596
4 11 22 7.333 1.18 1 0.20 20.387 0.07 14.311
5 11 22 7.333 0.26 1 0.10 15.259 0.06 14.056
6 129 348 94.113 0.06 0 0.00 13.170 0.05 13.831
7 110 636 93.780 0.02 0 0.10 14.082 0.04 13.638
8 26 99 20.592 0.32 0 0.20 17.733 0.03 13.475
9 75 74 37.248 0.27 1 0.30 23.813 0.02 13.343
10 32 32 16.000 0.80 1 0.40 32.149 0.01 13.241
11 22 22 11.000 0.54 1 0.50 42.657 0.00 13.170
12 43 38 20.173 0.18 1 0.01 13.129
13 24 24 12.000 0.02 0 0.02 13.118
14 19 32 11.922 0.23 1 0.03 13.136
15 80 79 39.748 0.18 0 0.04 13.185
16 72 72 36.000 0.06 0 0.05 13.263
17 65 255 51.797 0.30 1 0.06 13.370
18 233 224 114.206 0.07 1 0.07 13.505
19 65 67 32.992 0.07 0 0.08 13.669
0.09 13.862
source: Authors compilation. 0.10 14.082
note: nE  experimental group sample size; nC  control group
sample size; d  sample standardized mean difference given in (15), source: Authors compilation
and X  indicator variable given in (5). note: Based on data from nineteen studies of the effects of teacher
expectancy on pupil IQ.

__ __
The 95 percent confidence interval for  given in equa- ni
j1(Xj  X )(Yj  Y )
tion 11.13 is: ri  ____________________________
__________________________ , i 1, . . . , k. (11.22)
__ __
j1(Xj  X )2 j1(Yj Y )2
[
n n

_______ _______
i

][ i

]
0.02  1.96 0.00211  0.02  1.96 0.00211 ,
If we assume the effect sizes are homogeneous, the
hypotheses in (3) can be written as
or simplifying [0.070 0.110]. Because the confidence
interval includes the value 0, the effect does not differ
H0: 1  . . . .  k  0 (11.23)
from 0.
11.3.3.2 Population correlation coefficient If both HA: 1  . . . .  k    0.
the independent and dependent variables are continuous,
the effect size is the population correlation coefficient Suppose that we observe whether the mean difference
ri is positive for each study in a collection of k indepen-
Cov(X , Y ) dent studies. If we assume that the effect sizes are homo-
i  _________
i i

X
Y , i1, . . . , k, (11.21)
geneous, then ri is approximately normally distributed
with mean  and variance (12)2/ni. The probability of a
where Cov(Xi,Yi) is the population covariance for variables
positive result is
X and Y in the ith study, and
X and
Y are the respective
population standard deviations for variables X and Y in
the ith study. The sample estimator of i is the Pearson i  Pr( Ti  C  .5, v ) i
__
product moment correlation coefficient  Pr(ri  0)  1 [ ni /(12)], (11.24)
VOTE COUNTING PROCEDURES IN META-ANALYSIS 213

Table 11.3 Computations for Obtaining the Table 11.4 Sample Sizes, Correlations, and Indicator
Large-Sample Variance of Variable Values

Study pi Di(1) [Di(1)]2/[pi /(1  pi)] Study n r X

1 0.563 3.121 39.583 1 10 .68 1


2 0.554 2.682 29.119 2 20 .56 1
3 0.548 2.376 22.799 3 13 .23 1
4 0.522 1.079 4.664 4 22 .64 1
5 0.522 1.079 4.664 5 28 .49 1
6 0.577 3.798 59.100 6 12 .04 0
7 0.577 3.792 58.893 7 12 .49 1
8 0.536 1.803 13.070 8 36 .33 1
9 0.549 2.417 23.585 9 19 .58 1
10 0.532 1.591 10.162 10 12 .18 1
11 0.526 1.320 6.992 11 36 .11 0
12 0.536 1.785 12.805 12 75 .27 1
13 0.528 1.379 7.626 13 33 .26 1
14 0.528 1.374 7.576 14 121 .40 1
15 0.550 2.495 25.159 15 37 .49 1
16 0.548 2.376 22.799 16 14 .51 1
17 0.557 2.842 32.728 17 40 .40 1
18 0.585 4.167 71.507 18 16 .34 1
19 0.546 2.276 20.903 19 14 .42 1
473.734 20 20 .16 1

source: Authors compilation. source: Authors compilation.


note: From nineteen studies of the effects of teacher expectancy on note: From twenty studies of the relation between student ratings of
pupil IQ. the instructor and student achievement.
n  sample size; r  Pearson product moment correlation coefficient
given in (20); and X  indicator variable given in (5).
__
Note that i is a function of both  and ni. The log-like-
lihood function is log-likelihood function values on the left side of table
11.5. The coarse grid of  values reveals that the maxi-
L()  i 1 xi log[1(  ni /(1)2 ) ]
k __
mum value of L() lies between 0.10 and 0.30. To obtain
__ an estimate of  to two decimal places, we compute the
[ (1xi) log(  ni /(1)2 ) ], (11.25) log-likelihood function values between 0.10 and 0.30 in
steps of 0.01. The fine grid of  values is given on the
and the derivative is right side of table 11.5. Because the log-likelihood func-
___ tion is largest at   0.24, the MLE is   0.24.
i 1 ni 
( )
ni _______2 2
Di(1)  ___

 ___
2 (12)2
exp __12 _______ . (11.26) The computations for obtaining the large sample vari-
(12)2 ance of  are given in table 11.6. Substituting these val-
The large sample variance of  is obtained by substitut- ues in equation 11.12 yields the Var(  )  1/2026.900 
ing equation 11.24 for i and equation 11.26 for Di(1) in 0.000493. The 95 percent confidence interval for  given
equation (12). in 11.13 is:
Example: Estimating  from the Proportion of Positive ________ ________
Results. Suppose we wish to estimate  based on the 0.24  1.96 0.000493  0.24  1.96 0.000493 ,
proportion of positive results in data set III. Table 11.4
gives the values of n, r, and X for the 20 studies in data or simplifying [0.196 0.283]. Because the confidence in-
set III. Inserting these values into equation (25) for val- terval excludes the value 0, the correlation is significantly
ues of  ranging from 0.50 to 0.50 gives the coarse different from 0.
214 STATISTICALLY DESCRIBING STUDY OUTCOMES

Table 11.5 Log-Likelihood Function Values for  Table 11.6 Computations for Obtaining the
Large-Sample Variance of 
Coarse Grid Fine Grid
Study pi Di(1) [Di(1)]2/[pi /(1  pi)]
 L()  L()
1 0.790 3.413 70.118
0.50 159.677 0.10 8.961 2 0.873 3.489 109.552
0.40 95.780 0.11 8.640 3 0.821 3.530 84.708
0.30 59.089 0.12 8.346 4 0.884 3.430 114.607
0.20 36.659 0.13 8.079 5 0.911 3.185 125.273
0.10 22.568 0.14 7.837 6 0.811 3.503 80.133
0.00 13.863 0.15 7.621 7 0.811 3.503 80.133
0.10 8.961 0.16 7.430 8 0.937 2.787 131.041
0.20 6.900 0.17 7.263 9 0.867 3.513 106.710
0.30 7.108 0.18 7.119 10 0.811 3.503 80.133
0.40 9.552 0.19 6.998 11 0.937 2.787 131.041
0.50 14.990 0.20 6.900 12 0.986 1.136 95.356
0.21 6.824 13 0.928 2.940 129.840
0.22 6.770 14 0.997 0.325 41.491
0.23 6.738 15 0.939 2.735 131.219
0.24 6.727 16 0.830 3.547 89.008
0.25 6.737 17 0.946 2.580 131.152
0.26 6.769 18 0.846 3.553 96.821
0.27 6.821 19 0.830 3.547 89.008
0.28 6.895 20 0.873 3.489 109.552
0.29 6.991 2026.900
0.30 7.108
source: Authors compilation.
source: Authors compilation. note: From twenty studies of the relation between student ratings of
note: Based on data from twenty studies of the relation between the instructor and student achievement.
student ratings of the instructor and student achievement.

11.4 COMBINING VOTE COUNTS AND EFFECT standardized mean differences, that the next m2 studies
SIZE ESTIMATES (m2 k) only report the direction and statistical signifi-
cance of results, and that the last m3 studies (m3 k) only
Effect size procedures should be used for studies that report the direction of results, where m1  m2  m3  k.
include enough information to compute an effect size es- The synthesist can use effect size procedures to obtain an
timate. Vote-counting procedures should be used for stud- estimator d of the population standardized mean differ-
ies that do not provide enough information to compute an ence  using the first m1 studies (see chapter 14, this
effect size estimate, but do provide information about the volume). The synthesist can also use vote-counting pro-
direction and/or statistical significance of effects. We have cedures based on the proportion of significant positive
proposed a procedure, called the combined procedure, results to obtain a maximum likelihood estimator  s of 
that combines estimates based on effect size and vote- using the second m2 studies. Finally, vote-counting proce-
counting procedures to obtain an overall estimate of the dures based on the proportion of positive results can be
population effect size (see Bushman and Wang 1996). used to obtain the maximum likelihood estimator  D of 
using the last m3 studies. (We use the subscript S for the
11.4.1 Population Standardized Mean estimate based on studies that provide information about
Difference the significance and direction of results, and we use the
subscript D for the estimate based on studies that provide
Out of k independent studies, suppose that the first m1 information only about the direction of results.) The re-
studies (m1 k) report enough information to compute spective variances for d,  S, and  D are Var(d.), Var(  S),
VOTE COUNTING PROCEDURES IN META-ANALYSIS 215

and Var(  D). The combined weighted estimator of  is From equation 11.28, the variance of the combined
given by estimate is
d / Var(d)  S / Var(S)  D / Var(D)
1
 C  ______________________________ . (11.27) Var( C) __________________
1/ Var(d) 1/ Var(S) 1/ Var(D)
1/ Var(d) 1/ Var(D)

1
 _________________  0.00126
Because d,  S, and  D are consistent estimators of , 1/0.0015  1/0.0077
the combined estimator  C will also be consistent if:
the studies containing each type of information (that is, From equation 11.29, the 95 percent confidence inter-
information used to compute effect size estimates, infor- val for the combined estimate is (0.012  0.152). Note
mation about the direction and significance of results, that the confidence interval based on the combined esti-
information about only the direction of results) can be mate is narrower (0.152  0.012  0.140) than the confi-
considered a random sample from the population of stud- dence intervals based on effect sizes [0.124  (0.290) 
ies containing the same type of information, and m1, m2, 0.153] or the confidence interval based on the proportion
and m3 are all large enough and all converge to infinity at of positive results (0.431  0.086  0.345).
the same rate. The variance of the combined estimator is
given by
11.4.2 Population Correlation Coefficient
1
Var( C) ___________________________ (11.28) Out of k independent studies, suppose that the first m1
1/ Var(d) 1/ Var(S) 1/ Var(D)

studies (m1 k) report enough information to compute
1  0, ______
Because ______ 1  0, and _______
1  0, the sample correlation coefficients, that the next m2 studies
Var(d )
Var(S) Var( D) (m2 k) only report the direction and statistical signifi-
combined estimator  C is more efficient (in other words, cance of results, and that the last m3 studies (m3 k) only
has smaller variance) than d,  S, and  D. report the direction of results, where m1  m2  m3  k..
The upper and lower bounds of a 100(1 ) percent The synthesist can use effect size procedures to obtain an
confidence interval for the population standardized mean estimator r. of the population correlation coefficient 
difference  are given by using the first m1 studies (see chapter 14, this volume).
_______ _______ The synthesist can use vote-counting procedures based
 C  z/2Var( C)   C  z/2Var( C) . (11.29) on the proportion of significant positive results to obtain
a maximum likelihood estimator  S of  using the second
The confidence interval based on the combined estima- m2 studies. Finally, vote-counting procedures based on
tor is narrower than the confidence intervals based on the the proportion of positive results can be used to obtain the
other estimators. maximum likelihood estimator  D of  using the last m3
Example: Using the Combined Procedure to Obtain an studies. The respective variances for r,  S, and  D are
Estimate and a 95 Percent Confidence Interval for the Var(r.), Var(  S), and Var(  D). The combined weighted es-
Population Standardized Mean Difference . Suppose timator of  is given by
we wish to estimate  based on the results in data set II.
r / Var(r )   / Var(  )   / Var(  )
D .
Because all studies report effect size estimates, we as-  C  ______________________________
S S D
1/ Var(r )  1/ Var(  )  1/ Var(  )
(11.30)
sumed that study 4, 5, 11, 12, and 15 report only the S D

direction of the study outcome for purpose of illustration.


As with  C, the combined estimate  C will also be con-
The effect size estimates are d  0.048 and  D  0.259.
sistent if: the studies containing each type of information
The variances for the estimates are Var(d)  0.0015 and
can be considered a random sample from the population
Var(  D)  0.0077. Inserting the estimates and variances
of studies containing the same type of information, and
into equation 11.27 gives the combined estimate
m1, m2, and m3 are all large enough and all converge to
d/ Var(d)  D / Var(D)
infinity at the same rate. The variance of the combined
 C  ___________________ estimator is given by
1/ Var(d) 1/ Var(D)

0.048/0.0015  0.259/0.0077 1
 ________________________
1/0.0015  1 0.0077
 0.082 Var(  C) ___________________________
1/ Var(r ) 1/ Var( ) 1/ Var(  ) (11.31)
S D
216 STATISTICALLY DESCRIBING STUDY OUTCOMES

1  0, ______
Because ______ 1  0, and _______
1  0, the more likely to be published than studies with nonsignifi-
Var(r) Var(  S) Var(  D)
cant results, then published studies constitute a biased,
combined estimator is more efficient (that is, has smaller
unrepresentative sample of the studies in the population.
variance) than r,  S, and  D.
This bias can be reduced if meta-analysts count both pos-
The upper and lower bounds of a 100(1  ) percent
itive and negative significant results, as Larry Hedges and
confidence interval for the population correlation coeffi-
Ingram Olkin suggested:
cient  are given by
_______ _______ When both positive and negative significant results are
 C  z/2Var(  C)   C  z/2Var(  C) . (11.32) counted, it is possible to dispense with the requirement that
the sample available is representative of all studies con-
The confidence interval based on the combined estima- ducted. Instead, the requirement is that the sample of posi-
tor is narrower than the confidence intervals based on the tive and negative significant results is representative of the
other estimators. population of positive and negative significant results. If
Example: Using the Combined Procedure to Obtain an only statistically significant results tend to be published,
Estimate and a 95 Percent Confidence Interval for the this requirement is probably more realistic. (1980, 366
Population Correlation Coefficient . Suppose we wish [italics in original])
to estimate  based on the results in data set III. The effect
Suppose that a research synthesist wishes to obtain an
size estimates are r  0.362 and  D  0.318. The variances
estimate and confidence interval for  using the k estima-
for the estimates are Var(r)  0.0017 and Var(  D)0.0127.
tors T1, . . . , Tk. In this case, because the values T1, . . . , Tk
Inserting the estimates and variances into equation 11.29
are not observed, we count the number of times that
gives the combined estimate
Ti  C/2, and the number of times Ti  C/2. That is, we
r / Var(r )   / Var( ) count the number of two-sided tests that found significant
 C  ___________________
D D
1/ Var(r ) 1/ Var(  )
D positive or significant negative results.
0.362/0.0017  0.318/0.0127 To find a confidence interval for , we find the values
 ________________________
1/0.0017  1/0.0127
 0.357
[L, U] that correspond to the upper and lower confidence
From equation (31), the variance of the combined esti- intervals for  [L, U], where  denotes the proportion
mate is of the total number of significant results that were posi-
tive. That is,
1
Var(  C)  _________________
1/ Var(r ) 1/ Var(  )  
D
  ________
  
, (11.33)
1
 _________________
1/0.0017  1/0.0127  0.0015
where   Pr(Ti  C/2) and   Pr(Ti  C/2). The
From equation 11.31, the 95 percent confidence inter- MLE of  is
val for the combined estimate is (0.281 0.432).
p
Note that the confidence interval based on the com- p  _______
p   p
, (11.34)
bined estimate is narrower (0.432  0.281  0.151) than
the confidence intervals based on effect sizes (0.442  where p is the proportion of significant positive results
0.281  0.161) or the confidence interval based on the and p is the proportion of significant negative results in
proportion of positive results (0.539  0.097  0.442). the k independent studies.
Larry Hedges and Ingram Olkin produced a table (see
11.5 APPLICATIONS OF VOTE-COUNTING table 11.7) for obtaining an estimate and confidence in-
PROCEDURES terval for  given the proportion of the total number
of significant standardized mean differences that were
11.5.1 Publication Bias positive (1980). Brad Bushman produced a table (see
table 11.8) for obtaining an estimate and confidence in-
It is important to note that vote-counting procedures pro- terval for  given the proportion of the total number of
vide meta-analysts another tool to deal with the problem significant correlations that were positive (1994). Exam-
of publication bias (for an in-depth discussion, see chap- ples 1 and 2 illustrate how to use tables 11.7 and 11.8,
ter 23, this volume). If studies with significant results are respectively.
VOTE COUNTING PROCEDURES IN META-ANALYSIS 217

Table 11.7 Conditional Probability p (, n)  p {t > 0  |t | > Ct}

Population Standardized Mean Difference 

n 0.00 0.02 0.04 0.06 0.08 0.10 0.15 0.20 0.25 0.30 0.40 0.50 0.70
2 0.500 0.516 0.531 0.547 0.562 0.577 0.615 0.651 0.685 0.718 0.777 0.827 0.900
4 0.500 0.528 0.556 0.584 0.611 0.638 0.700 0.756 0.805 0.845 0.906 0.945 0.982
6 0.500 0.537 0.573 0.609 0.643 0.676 0.751 0.814 0.863 0.901 0.950 0.976 0.994
8 0.500 0.544 0.586 0.628 0.668 0.706 0.788 0.852 0.899 0.932 0.971 0.988 0.998
10 0.500 0.549 0.598 0.644 0.689 0.729 0.816 0.879 0.923 0.952 0.982 0.993 0.999
12 0.500 0.555 0.608 0.659 0.706 0.750 0.838 0.900 0.940 0.964 0.988 0.996 1.000
14 0.500 0.559 0.617 0.672 0.722 0.767 0.857 0.916 0.952 0.973 0.992 0.998 1.000
16 0.500 0.564 0.625 0.683 0.736 0.783 0.872 0.929 0.961 0.979 0.994 0.998 1.000
18 0.500 0.568 0.633 0.694 0.749 0.796 0.886 0.939 0.968 0.984 0.996 0.999 1.000
20 0.500 0.572 0.640 0.704 0.760 0.809 0.897 0.947 0.974 0.987 0.997 0.999 1.000
22 0.500 0.575 0.647 0.713 0.771 0.820 0.907 0.954 0.978 0.990 0.998 1.000
24 0.500 0.579 0.654 0.721 0.781 0.830 0.915 0.960 0.982 0.992 0.998
50 0.500 0.614 0.716 0.800 0.864 0.910 0.970 0.991 0.997 0.999 1.000
100 0.500 0.659 0.789 0.878 0.933 0.964 0.993 0.999 1.000 1.000

source: Hedges and Olkin 1980.


notes: Proportions less than .500 correspond to negative values of .
Probability that a two-sample t statistic with n subjects per group is positive given that the absolute value of the t statistic exceeds the   .05
critical value.

Example 1: Obtaining an Estimate and Confidence In- yields the estimate   0.011 and the 95 percent confi-
terval for the Population Standardized Mean Difference  dence interval [0.019 0.048]. Because the confidence
Based on the Proportion of the Total Number of Signifi- interval includes the value zero, we fail to reject the null
cant Mean Differences that Were Positive. Suppose that hypothesis that   0.
all the results in data set II are significant; thus we have Example 2: Obtaining an Estimate and Confidence In-
eleven significant positive results and eight significant terval for the Population Correlation Coefficient  Based
negative results. One needs an average sample size for on the Proportion of the Total Number of Significant cor-
table 11.7. There are several averages to choose from relations that Were Positive. Suppose that all the results
(arithmetic mean, for example). Jean Gibbons, Ingram in data set III are significant; thus we have eighteen sig-
Olkin, and Milton Sobel recommended using the square nificant positive results and two significant negative
mean root (nSMR) as the average sample size because it is results. Using equation 11.35, we obtain the square mean
not as influenced by extreme values as the arithmetic root
mean is (1977): ___ ___
nSMR  [ ( 10  . . .  20 )/20 ]  25.878, or 26.
2
___ __
n1   nk
nSMR  ( ___________
k ).
2
(11.35)
Our estimate of  given by equation (34) is p  18/
The square mean root for the nineteen studies was cal- (18  2)  0.900. Linear interpolation in table 11.8 yields
culated by substituting the mean of the experimental and the estimate   0.100 and the 95 percent confidence in-
control group sample sizes in equation 11.35, terval [0.058 0.300]. Because the confidence interval ex-
cludes the value zero, we reject the null hypothesis that
___________ __________   0.
nSMR  [ ( (77  339)/2  . . . (65  67)/2 )/19 ]
2

 83.515, or 84. 11.5.2 Results All in the Same Direction


Our estimate of  given by equation 11.34 is The method of maximum likelihood cannot be used if the
p  11/(11  8)  0.579. Linear interpolation in table 11.7 MLE of  (that is, p  U/k) is 1 or 0 because there is no
Table 11.8 Conditional Probability p (, n)  p {r > 0  |t | > Cr}

n 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
3 0.500 0.508 0.516 0.524 0.531 0.539 0.547 0.555 0.563 0.570 0.578 0.654 0.725 0.790 0.848 0.897 0.937 0.968 0.989
4 0.500 0.512 0.525 0.537 0.550 0.562 0.574 0.586 0.598 0.610 0.622 0.732 0.823 0.891 0.938 0.968 0.986 0.995 0.999
5 0.500 0.516 0.533 0.549 0.565 0.581 0.597 0.613 0.628 0.644 0.659 0.790 0.883 0.940 0.972 0.988 0.996 0.999 1.000
6 0.500 0.520 0.540 0.559 0.578 0.598 0.617 0.635 0.654 0.671 0.689 0.832 0.920 0.965 0.986 0.995 0.999 1.000
7 0.500 0.523 0.546 0.568 0.590 0.612 0.634 0.655 0.675 0.695 0.714 0.864 0.943 0.978 0.993 0.998 0.999 1.000
8 0.500 0.526 0.551 0.576 0.601 0.625 0.649 0.672 0.694 0.715 0.736 0.887 0.958 0.986 0.996 0.999 1.000
9 0.500 0.528 0.556 0.583 0.610 0.637 0.662 0.687 0.711 0.733 0.755 0.906 0.969 0.991 0.997 0.999 1.000
10 0.500 0.530 0.560 0.590 0.619 0.647 0.674 0.701 0.725 0.749 0.771 0.920 0.976 0.994 0.998 1.000
11 0.500 0.532 0.565 0.596 0.627 0.657 0.686 0.713 0.739 0.763 0.786 0.932 0.981 0.995 0.999 1.000
12 0.500 0.534 0.569 0.602 0.635 0.666 0.696 0.724 0.751 0.776 0.799 0.942 0.985 0.997 0.999 1.000
13 0.500 0.536 0.572 0.607 0.642 0.674 0.706 0.735 0.762 0.788 0.811 0.950 0.988 0.998 1.000
14 0.500 0.538 0.576 0.613 0.648 0.682 0.714 0.745 0.773 0.799 0.822 0.956 0.991 0.998 1.000
15 0.500 0.540 0.579 0.618 0.655 0.690 0.723 0.754 0.782 0.808 0.832 0.962 0.992 0.999 1.000
16 0.500 0.542 0.582 0.622 0.661 0.697 0.731 0.762 0.791 0.818 0.841 0.966 0.994 0.999 1.000
17 0.500 0.543 0.586 0.627 0.666 0.704 0.738 0.770 0.800 0.826 0.850 0.970 0.995 0.999 1.000
18 0.500 0.545 0.589 0.631 0.672 0.710 0.745 0.778 0.807 0.834 0.857 0.974 0.996 0.999 1.000
19 0.500 0.546 0.591 0.635 0.677 0.716 0.752 0.785 0.815 0.841 0.865 0.977 0.997 1.000
20 0.500 0.548 0.594 0.639 0.682 0.722 0.759 0.792 0.822 0.848 0.871 0.979 0.997 1.000
21 0.500 0.549 0.597 0.643 0.687 0.728 0.765 0.798 0.828 0.854 0.877 0.981 0.998 1.000
22 0.500 0.550 0.600 0.647 0.692 0.733 0.771 0.804 0.834 0.861 0.883 0.983 0.998 1.000
23 0.500 0.552 0.602 0.651 0.696 0.738 0.776 0.810 0.840 0.866 0.889 0.985 0.998 1.000
24 0.500 0.553 0.605 0.654 0.701 0.743 0.782 0.816 0.846 0.872 0.894 0.986 0.999 1.000
25 0.500 0.554 0.607 0.658 0.705 0.748 0.787 0.821 0.851 0.877 0.898 0.988 0.999 1.000
50 0.500 0.579 o.654 0.723 0.782 0.832 0.872 0.904 0.928 0.947 0.961 0.998 1.000
100 0.500 0.613 0.715 0.799 0.863 0.909 0.941 0.962 0.976 0.985 0.990 1.000
200 0.500 0.658 0.788 0.877 0.933 0.964 0.981 0.990 0.995 0.997 0.999 1.000
400 0.500 0.717 0.866 0.943 0.977 0.991 0.996 0.999 0.999 1.000 1.000 1.000

source: Bushman 1994.


notes: Probability that a correlation coefficient r from a sample of size n is positive given that the absolute value of r exceeds the   .05 critical value for effect size .
Proportions less than .500 correspond to negative values of .
VOTE COUNTING PROCEDURES IN META-ANALYSIS 219

unique value of p. If all the results are in the same direc- random. It is well known that nonsignificant effects are
tion, we can obtain a Bayes estimate of  (Hedges and less likely to be reported than significant effects (Green-
Olkin 1985, 300). For example, we can assume that  wald 1975). Imputing the value zero (solution one) or the
has a truncated uniform prior distribution. That is, we as- mean (solution three), for missing effect sizes artificially
sume that there is a value of 0 such that any value of  decreases the variance of effect size estimates. Imputing
in the interval [0 1] is equally likely. The Bayes estimate the value zero (solution two) or using the lower limit for
of  is an effect size (solution four) underestimates the popula-
tion effect size.
(k 1)(1 0k 2 )
p  ______________
(k  2) 1  k 1
(11.36) Our combined procedure (described in section 11.4)
( 0 ) has at least five advantages over conventional procedures
Example: Bayes Estimate for Results All in the Same for dealing with the problem of missing effect size esti-
Direction. Suppose that all the studies in data set II found mates (Bushman and Wang 1996). First, the combined
positive results and that it is reasonable to assume that the procedure uses all the information available from studies
value of  is in the interval [.5 1], and that any value in in a meta-analysis; most of the other procedures ignore
the interval is equally likely. The Bayes estimate given in information about the direction and statistical signifi-
equation (35) is cance of results. Second, the combined estimator is con-
sistent (that is, if the number of studies k is large, the
(19 1)(1 .5(19  2) ) combined estimator will not underestimate or overesti-
p  _________________
(19  2)(1 .5(19  1) )
 0.952.
mate the population effect size). Third, the combined es-
timator is more efficient (that is, has a smaller variance)
Unfortunately, one cannot compute a confidence inter-
than either the effect size or vote-counting estimators.
val for the Bayes estimate because the variance estimator
Fourth, the variance of the combined estimator is known.
for this Bayes estimate does not exist.
All of the conventional approaches for handling missing
data either artificially inflate or artificially deflate the
11.5.3 Missing Data variance of the effect size estimates. Fifth, the combined
estimator gives weight to all studies proportional to the
Missing effect size estimates is one of the largest prob- Fisher information they provide (that is no studies are
lems facing the practicing meta-analyst. Sometimes re- overweighted).
search reports do not include enough information (for
example, means, standard deviations, statistical tests) to 11.6 CONCLUSIONS
permit the calculation of an effect size estimate. Unfortu-
nately, the proportion of studies with missing effect size We hope this chapter will improve the bad reputation that
estimates in a research synthesis is often quite large, vote-counting procedures have gained. It is true that vote-
about 25 percent in psychological studies (Bushman and counting procedures are not the method of choice. If
Wang 1995, 1996). While sophisticated missing data studies include enough information to compute effect
methods are available for meta-analysis (see chapter 21, size estimates, then effect size procedures should be used.
this volume), the most commonly used solutions to the Moreover the vote-counting procedures that are currently
problem of missing effect size estimates are to: omit available are based on the fixed effects model and they do
from the synthesis those studies with missing effect size not provide information about the potential heterogeneity
estimates and analyze only complete cases, set the miss- of effects across studies. Often, however, not all studies
ing effect size estimates equal to zero, set the missing contain enough information to compute an effect size es-
effect size estimates equal to the mean obtained from timate. If the studies with missing effects do provide in-
studies with effect size estimates, and use the available formation about the direction and statistical significance
information in a research report to get a lower limit of results (for example, a table of means with no standard
for the effect size estimate (Rosenthal 1994). Unfortu- deviations or statistical tests) and a fixed effects estimate
nately, all of these procedures have serious problems that is desired, vote-counting procedures can and should be
limit their usefulness (Bushman and Wang 1996). Solu- used. The meta-analyst can then combine the effect size
tions one and three make the unrealistic assumption that estimate with the vote-counting estimates to obtain an
missing effect size estimates are missing completely at overall estimate of the population effect size.
220 STATISTICALLY DESCRIBING STUDY OUTCOMES

11.7 NOTES Dochy, Filip, Mien Segers, Piet Van den Bossche, and David
Gijbels. 2003. Effects of Problem-Based Learning: A
1. We use the term positive to refer to effects in the pre- Meta-Analysis. Learning and Instruction 5(13): 53368.
dicted direction. Friedman, Lee 2001. Why Vote-Count Reviews Dont
2. Larry Hedges found that estimators based on the third Count. Biological Psychiatry 49(2): 16162.
type of data are, at best, only 64 percent as efficient as Gibbons, Jean D., Ingram Olkin, and Milton Sobel 1977. Se-
estimators based on the first type of data (1986). The lecting and Ordering Populations: A New Statistical Metho-
maximum relative efficiency occurs when the popula- dology. New York: John Wiley & Sons.
tion effect size equals zero for all studies. When the Greenwald, Anthony G. 1975. Consequences of Prejudice
population effect size does not equal zero for all stud- Against the Null Hypothesis. Psychological Bulletin 82(1):
ies, relative efficiency decreases as the population ef- 120.
fect size increases and as the sample size increases. Hedges, Larry V. 1986. Estimating Effect Sizes from Vote
3. Generally the term independent variable refers to Counts or Box Score Data. Paper presented at the annual
variables that are directly manipulated by the re- meeting of the American Educational Research Associa-
searcher. We also use the term to refer to variables that tion. San Francisco, Calif. (April 1986).
are measured by the researcher (for example, partici- Hedges, Larry V., and Ingram Olkin. 1980. Vote-Counting
pant gender). Although this is not technically correct, Methods in Research Synthesis. Psychological Bulletin
it simplifies the discussion considerable. 88(2): 35969.
4. We assume that the synthesist has a priori predictions . 1986. Statistical Methods for Meta-Analysis. New
about the direction of the relationship and is there- York: Academic Press.
fore using a one-sided test. For a two-sided test, we let Jewell, George, and Mark E. McCourt 2000. Pseudoneglect:
Ti2 be an estimator of i2. In general, it is not possible A Review and Meta-Analysis of Performance Factors in
to use a two-sided test statistic Ti2 to estimate  if  Line Bisection Tasks. Neuropsychologia 38: 93110.
can be either positive or negative (Hedges and Olkin Lee, Valerie E., and Anthony S. Bryk. 1989. Effects of
1985, 52). Single-Sex Schools: Response to Marsh. Journal of Edu-
cational Psychology 81(4): 64750.
Light, Richard J., and Paul V. Smith. 1971. Accumulating
11.8 REFERENCES Evidence: Procedures for Resolving Contradictions Among
Different Research Studies. Harvard Educational Review
Brown, Lawrence D., T. Tony Cai, and Anirban DasGupta 41(4): 42971.
2001. Interval Estimation for a Binomial Proportion. Sta- Mann, Charles C. 1994. Can Meta-Analysis Make Policy?
tistical Science 16(2): 10133. Science 266(11): 96062.
Bushman, Brad J. 1994. Vote-Counting Procedures in Meta- Rafaeli-Mor, Eshkol, and Jennifer Steinberg. 2002. Self-
Analysis. In The Handbook of Research Synthesis, edited Complexity and Well-Being: A Review and Research Syn-
by Harris Cooper and Larry V. Hedges. New York: Russell thesis. Personality and Social Psychology Review 6(1):
Sage Foundation. 3158.
Bushman, Brad J., and Morgan C. Wang. 1995. A Proce- Richard, F. Daniel, Charles F. Bond, and Juli J. Stokes-Zoota.
dure for Combining Sample Correlations and Vote Counts 2003. One Hundred Years of Social Psychology Quantita-
to Obtain an Estimate and a Confidence Interval for the Po- tively Described. Review of General Psychology 7(4):
pulation Correlation Coefficient. Psychological Bulletin 33163.
117(3): 53046. Rosenthal, Robert. 1994. Parametric Measures of Effect
. 1996. A Procedure for Combining Sample Stan- Size. In Handbook of Research Synthesis, edited by Harris
dardized Mean Differences and Vote Counts to Estimate Cooper and Larry V. Hedges. New York: Russell Sage
the Population Standardized Mean Difference in Fixed Foundation.
Effects Models. Psychological Methods 1(1): 6680. Saroglou, Vassilis. 2002. Religion and the Five Factors of
Cohen, Jacob. 1988. Statistical Power Analysis for the Beha- Personality: A Meta-Analytic Review. Personality and In-
vioral Sciences, 2nd ed. New York: Academic Press. dividual Differences 32(1): 1525.
Conover, William J. 1980. Practical Nonparametric Statis- Warner, James. 2001. Quality of Evidence in Meta-Analy-
tics, 2nd ed. New York: John Wiley & Sons. sis. British Journal of Psychiatry 179(July): 79.
12
EFFECT SIZES FOR CONTINUOUS DATA

MICHAEL BORENSTEIN
Biostat

CONTENTS
12.1 Introduction 222
12.1.1 Effect Sizes and Treatment Effects 222
12.1.2 Effect Sizes Rather than p-Values 223
12.1.3 Effect-Size Parameters and Sample Estimates of Effect Sizes 223
12.1.4 The Variance of the Effect Size 223
12.1.5 Calculating Effect-Size Estimates from Reported Information 224
12.2 The Raw (Unstandardized) Mean Difference D 224
12.2.1 Computing D, Independent Groups 224
12.2.2 Computing D, Pre-Post Scores or Matched Groups 225
12.2.3 Including Different Study Designs in the Same Analysis 225
12.3 The Standardized Mean Difference d and g 225
12.3.1 Computing d and g, Independent Groups 226
12.3.2 Computing d and g, Pre-Post Scores or Matched Groups 227
12.3.3 Computing d and g, Analysis of Covariance 228
12.3.4 Including Different Study Designs in the Same Analysis 230
12.4 Correlation as an Effect Size 231
12.4.1 Computing r 231
12.5 Converting Effect Sizes from One Metric to Another 231
12.5.1 Converting from the Log Odds Ratio to d 232
12.5.2 Converting from d to the Log Odds Ratio 233
12.5.3 Converting from r to d 233
12.5.4 Converting from d to r 234
12.6 Interpreting Effect Sizes 234
12.7 Resources 234
12.8 Acknowledgments 234
12.9 References 235

221
222 STATISTICALLY DESCRIBING STUDY OUTCOMES

12.1 INTRODUCTION tions (for effect sizes for binary outcomes, see chapter 13,
this volume).
In any meta-analysis, we start with summary data from
each study and use it to compute an effect size for the 12.1.1 Effect Sizes and Treatment Effects
study. An effect size is a number that reflects the magni-
tude of the relationship between two variables. For ex- Meta-analyses in medicine often refer to the effect size
ample, if a study reports the mean and standard deviation as a treatment effect, and this term is sometimes assumed
for the treated and control groups, we might compute the to refer to odds ratios, risk ratios, or risk differences,
standardized mean difference between groups. Or, if a which are common in meta-analyses that deal with medi-
study reports events and nonevents in two groups we cal interventions. Similarly, meta-analyses in the social
might compute an odds ratio. It is these effect sizes that sciences often refer to the effect size simply as an effect
are then compared and combined in the meta-analysis. size, and this term is sometimes assumed to refer to stan-
Consider figure 12.1, the forest plot of a fictional meta- dardized mean differences or to correlations, which are
analysis to assess the impact of an intervention. In this common in social science meta-analyses.
plot, each study is represented by a square, bounded on In fact, though, both the terms effect size and treatment
either side by a confidence interval. The location of each effect can refer to any of these indices, and the distinction
square on the horizontal axis represents the effect size for between the terms lies not in the index but rather in the
that study. The confidence interval represents the preci- nature of the study. The term effect size is appropriate
sion with which the effect size has been estimated, and when the index is used to quantify the relationship be-
the size of each square is proportional to the weight that tween any two variables or a difference between any two
will be assigned to the study when computing the com- groups. By contrast, the term treatment effect is appropri-
bined effect. This figure also serves as the outline for this ate only for an index used to quantify the impact of a de-
chapter, in which I discuss what these items mean and liberate intervention. Thus, the difference between males
how they are computed. This chapter addresses effect and females could be called an effect size only, while the
sizes for continuous outcomes such as means and correla- difference between treated and control groups could be

Impact of Intervention

Mean Total Standard Mean difference


Study Difference N Variance Error p-value and 95% confidence interval
A 0.400 60 0.067 0.258 0.121

B 0.200 600 0.007 0.082 0.014

C 0.300 100 0.040 0.200 0.134

D 0.400 200 0.020 0.141 0.005

E 0.300 400 0.010 0.100 0.003

F 0.200 200 0.020 0.141 0.157


Combined 0.214 0.003 0.051 0.000

1.0 0.5 0.0 0.5 1.0

Figure 12.1 Fictional Meta-Analysis Showing Impact of an Intervention


source: Authors compilation.
EFFECT SIZES FOR CONTINUOUS DATA 223

called either an effect size or a treatment effect. Note, and to compute a combined estimate of the effect across
however, that the classification of an index as an effect size studies. Here, the precision of each effect is used to as-
or a treatment effect has no bearing on the computations. sign a weight to that effect in these analyses.
Four major considerations should drive the choice of
an effect-size index. The first is that the effect sizes from 12.1.3 Effect-Size Parameters and Sample
the different studies should be comparable to one another Estimates of Effect Sizes
in the sense that they measure, at least approximately, the
same thing. That is, the effect size should not depend on Throughout this chapter we make the distinction between
aspects of study design that may vary from study to study an underlying effect-size parameter (denoted by the Greek
(such as sample size or whether covariates are used). letter ) and the sample estimate of that parameter (de-
The second is that the effect size should be substantively noted by T ).
interpretable. This means that researchers in the substan- If a study had an infinitely large sample size, it would
tive area of the work represented in the synthesis should yield an effect size T that was identical to the population
find the effect size meaningful. The third is that estimates parameter . In fact, though, sample sizes are finite and so
of the effect size should be computable from the informa- the effect-size estimate T always differs from  by some
tion likely to be reported in published research reports. amount. The value of T will vary from sample to sample,
That is, it should not require reanalysis of the raw data. and the distribution of these values is the sampling distri-
The fourth is that the effect size should have good techni- bution of T. Statistical theory allows us to compute the
cal properties. For example, its sampling distribution variance of the effect-size estimates.
should be known so that variances and confidence inter-
vals can be computed. 12.1.4 The Variance of the Effect Size

12.1.2 Effect Sizes Rather than p -Values A dominant factor in variance is the sample size, with
larger studies having a smaller variance and yielding a
Reports of primary research typically include the p-value more precise estimate of the effect-size parameter. Using
corresponding to a test of significance. This p-value re- a matched design or including a covariate to reduce the
flects the likelihood that the sample would have yielded error term will usually lead to a lower variance and more
the observed effect (or one more extreme) if the null hy- precise estimate. Additionally, the variance of any given
pothesis were true. effect size is affected by specific factors that vary from
Researchers often use the p-value as a surrogate for the one effect-size index to the next.
effect size, with a significant p-value taken to imply a The sampling distribution of an effect-size estimate T
large effect and a nonsignificant p-value taken to imply can be expressed as a standard error or as a variance
a trivial effect. In fact, however, while the p-value is (which is simply the square of the standard error).
partly a function of effect size, it is also partly a function When our focus is on the effect size for a single study,
of sample size. A p-value of 0.01 could reflect a large ef- we generally work with the standard error of the effect
fect but could also reflect a trivial effect in a large sample. size, which is turn may be used to compute confidence
Conversely, a p-value of 0.20 could reflect a trivial effect intervals about the effect size. In figure 12.1, for example,
but could also reflect a large effect in a small sample. In study E has four times the sample size of study C (400
figure 12.1, for example, study A has a p-value of 0.12 versus 100). Its standard error (the square root of the vari-
whereas study B has a p-value of 0.01, but it is study A ance) is therefore half as large (0.10 versus 0.20) and its
that has the larger effect size (0.40 versus 0.20) (see, for confidence interval half as wide as that of study C. (In
example, Borenstein 1994). this example, all other factors that could affect the preci-
In primary studies we can avoid this kind of confusion sion were held constant).
by reporting the effect size and the precision separately. By contrast, in a meta-analysis we work primarily with
The former gives us a pure estimate of the effect in the the variance rather than the standard error. In a fixed effect
sample, and the latter gives us a range for the effect in analysis, for example, we weight by the inverse variance,
the population. Similarly, in a meta-analysis, we need to or 1/v. In figure 12.1, study E has four times the sample
work with a pure measure of the effect size from each size of study C (400 versus100) and its variance is there-
primary study to determine if the effects are consistent, fore one-fourth as large (0.01 versus 0.04). The square
224 STATISTICALLY DESCRIBING STUDY OUTCOMES

representing study E has four times the area as the one for sample estimate of  is just the sample mean difference,
study C, reflecting the fact that it will be assigned four namely
times as much weight in the analysis. __ __
D  Y1  Y 2. (12.2)
12.1.5 Calculating Effect-Size Estimates from Note that uppercase D is used for the raw mean differ-
Reported Information ence, whereas lowercase d will be used for the standard-
When researchers have access to a full set of summary ized mean difference (section 12.3).
data such as means, standard deviations, and sample size Let S1 and S2 be the sample standard deviations of the
for each group, the computation of the effect size and its two groups, and n1 and n2 be the sample size in the two
variance is relatively straightforward. In practice how- groups. If we assume that the two population standard
ever, researchers will often find themselves working with deviations are the same (as is assumed in most parametric
only partial data. For example, a paper may publish only data analysis techniques), so that 1  2  , then the
the p-value and sample size from a test of significance, variance of D is
leaving it to the meta-analyst to back-compute the effect n n
size and variance. For this reason, each of the following vD  ______
1 2 2
n1n2 SPooled
, (12.3)
sections includes a table that shows how to compute the
effect size and variance from some of the more common where
reporting formats. For additional information on comput- (n 1)S 2  (n 1)S 2
2
SPooled  __________________
1 1
n n 2
2 2
. (12.4)
ing effect sizes from partial information, see the resources 1 2

section at the end of this chapter.


If we dont assume that the two population standard
deviations are the same, then the variance of D is
12.2 THE RAW (UNSTANDARDIZED) MEAN S2 S2
DIFFERENCE D vD  ___
N
1
 ___
N
2
. (12.5)
1 2

When the outcome is reported on a meaningful scale and In either case, the standard error of D is then the square
all studies in the analysis use the same scale, the meta- root of vD,
analysis can be performed directly on the raw mean differ-
___
ence. The primary advantage of the raw mean difference SED  vD . (12.6)
is that it is intuitively meaningful, either inherently (blood
pressure) or because of widespread use (for example, __ For example,
__ suppose that a study has sample means
scores on a national achievement test for students). Y1  103, Y2  100, sample standard deviations S1  5.5,
Consider a study that reports means for two groups S2  4.5, and sample sizes n1  n2  50. The raw mean
(treated and control) and suppose we wish to compare the difference D is
means of these two groups. Let 1 and 2 be the true
(population) means of the two groups. The population D  103  100  3.000.
mean difference is defined as
If we assume that 12  22 then the pooled standard
   1   2. (12.1) deviation within groups is
_______________________
(50 1)  5.52  (50  1)  4.52
In this section I show how to estimate  from studies
that use two independent groups, and from studies that

SWithin  __________________________
50  50  2
 5.0249.

use matched groups or a pre-post design. The variance and standard error of D are given by
50  50
12.2.1 Computing D, Independent Groups vD  _______
50  50
 5.02492  1.0100

We can estimate the mean difference  from __a study__that and


uses two independent groups as follows. Let Y1 and Y2 be ______
the sample means of the two independent groups. The SED  1.0100  1.0050.
EFFECT SIZES FOR CONTINUOUS DATA 225

If we do not assume that 12  22, then the variance the variances have been computed to take account of the
and standard error of D are given by study design. We can therefore use these effect sizes and
2 2
variances to perform a meta-analysis without concern for
5.5 4.5
vD  ____
50
 ____
50
1.0100, the different designs in the original studies. Note that
the phrase different study designs is being used here to
and refer to the fact that one study used independent groups
______ and another used a pre-post design or matched groups. It
SED  1.0100 1.0050. does not refer to the case where one study was a random-
ized trial (or quasi-experimental study) and another was
In this example both formulas yield the same result be- observational. The issues involved in combining these
cause n1  n2. different kinds of studies are beyond the scope of this
chapter. __ __
12.2.2 Computing D, Pre-Post Scores For __ study designs the direction of the effect (Y1  Y2
__ all
or Matched Groups or Y2  Y1) is arbitrary, except that the researcher must de-
cide on a convention and then apply this consistently. For
We can estimate the mean difference  from
__ a study
__ that example, if a positive difference will indicate that the
uses two matched groups as follows. Let Y1 and Y2 be the treated group did better than the control group, then this
sample means of the two groups. The sample estimate of convention must apply for studies that used independent
 is just the sample mean difference, namely designs and for studies that used pre-post designs. In
__ __ some cases it might be necessary to reverse the computed
D  Y 1Y 2. (12.7) sign of the effect size to ensure that the convention is fol-
lowed.
However, for the matched study, the variance of D is
computed as 12.3 THE STANDARDIZED MEAN DIFFERENCE
2
SDifference d AND g
vD  _______
n . (12.8)
As noted, the raw mean difference is a useful index when
In this equation, n is the number of pairs and SDifference is the measure is meaningful, either inherently or because
the standard deviation of the paired differences, of widespread use. By contrast, when the measure is less
____________________ well known (for example, a proprietary scale with limited
SDifference  S12  S22  2  r  S1  S2 , (12.9) distribution), the use of a raw mean difference has less to
recommend it. In any event, the raw mean difference is an
where r is the correlation between siblings in matched option only if all the studies in the meta-analysis use the
pairs. As r moves toward 1.0, the standard error of the same scale. If different studies use different instruments,
paired difference will decrease. such as different psychological or educational tests, to as-
The formulas for matched designs apply to pre-post sess the outcome, then the scale of measurement will dif-
designs as well. The pre and post means correspond to the fer from study to study and it would not be meaningful to
means in the matched groups, n is the number of subjects, combine raw mean differences.
and r is the correlation between pre-scores and post- In such cases, Gene Glass suggested dividing the mean
scores. difference in each study by that studys standard deviation
to create an index (the standardized mean difference) that
12.2.3 Including Different Study Designs in the would be comparable across studies (1976). The same
Same Analysis approach had been suggested earlier by Jacob Cohen in
connection with describing the magnitude of effects in
To include data from independent groups and matched statistical power analysis (1969, 1977).
groups in the same meta-analysis, one would simply The standardized mean difference can be thought of as
compute the raw mean difference and its variance for being comparable across studies based on either of two
each study using the appropriate formulas. The effect size arguments (Hedges and Olkin 1985). If the outcome mea-
has the same meaning regardless of the study design, and sures in all studies are linear transformations of each
226 STATISTICALLY DESCRIBING STUDY OUTCOMES

other the standardized mean difference can be seen as the 1  2  ), it is unlikely that the sample estimates S1
mean difference that would have been obtained if all data and S2 will be identical. By pooling, we obtain a more
were transformed to a scale where the standard deviation accurate estimate of their common value.
within-groups was equal to 1.0. This index, the sample estimate of the standardized
The other argument for comparability of standardized mean difference, is often called Cohens d in research
mean differences is that the standardized mean difference synthesis. Some confusion about the terminology has re-
is a measure of overlap between distributions. In this tell- sulted from the fact that the index , which Cohen origi-
ing, the standardized mean difference reflects the differ- nally proposed as a population parameter for describing
ence between the two distributions, how each represents the size of effects for statistical power analysis is also
a distinct cluster of scores even if they do not measure sometimes called d. In this handbook, we use the symbol
exactly the same outcome (see Cohen 1977; Grissom and  to denote the effect-size parameter and d for the sample
Kim 2005). estimate of that parameter.
Consider a study that uses independent groups, and The variance of d is given (to a very good approxima-
suppose we wish to compare their means. Let 1 and 1 tion) by
be the true (population) mean and standard deviation of
n n d 2
the first group and let 2 and 2 be the true (population) vd  ______ ________
n1n2  2(n n ) ,
1 2
(12.13)
1 2
mean and standard deviation of the other group. If the
two population standard deviations are the same (as is In this equation the first term on the right expresses the
assumed in most parametric data analysis techniques), so uncertainty in the estimate of the mean difference [the
that 1  2  , then the standardized mean difference numerator in formula 12.11], and the second expresses
parameter or population standardized mean difference is the uncertainty in the estimate of SWithin [the denominator
defined as in formula 12.11].
 
The standard error of d is the square root of vd,
  _______
1 2
 . (12.10)
__
SEd  vd . (12.14)
In this section I show how to estimate  from studies
that use independent groups, studies that use pre-post or It turns out that d has a slight bias, tending to overesti-
matched group designs, and studies that use analysis of mate the absolute value of  in small samples This bias
covariance. can be removed by a simple correction that yields an un-
biased estimate of . This estimate is sometimes called
Hedges g (Hedges 1981). To convert from d to Hedges g
12.3.1 Computing d and g, Independent Groups we use a correction factor, J, defined as
We can estimate the standardized mean difference () 3
from studies that use two independent groups as J(df )  1 _______
4 df 1
. (12.15)
__ __
Y Y In this expression, df is the degrees of freedom used to
d  ______
1
S
2
. (12.11)
Within estimate SWithin , which for two independent groups is
__ __
In the numerator, Y1 and Y2 are the sample means in the n1  n2  2. Then,
two groups. In the denominator SWithin is the within-groups
standard deviation, pooled across groups, g  j(df )d, (12.16)
________________ vg  [J(df )] vd ,
2
(12.17)
(n1 1)S1  (n2  1)S2

2 2
SWithin  __________________
n n 2
, (12.12)
1 2
and
where n1, n2 are the sample size in the two groups, and __
S1, S2 are the standard deviations in the two groups. The SEg  vg . (12.18)
reason that we pool the two sample estimates of the stan-
dard deviation is that even if we assume that the underly- __ For example,
__ suppose a study has sample means
ing population standard deviations are the same (that is Y1  103, Y2  100, sample standard deviations S1  5.5,
EFFECT SIZES FOR CONTINUOUS DATA 227

S2  4.5, and sample sizes n1  n2  50. We would esti- Table 12.1 provides formulas for computing d and its
mate the pooled-within-groups standard deviation as variance for independent groups, working from informa-
______________________ tion that may be reported in a published paper.
(50 1)  5.52  (50 1)  4.52

SWithin  __________________________
50  50  2
 5.0249.
12.3.2 Computing d and g, Pre-Post Scores
Then, or Matched Groups
103 100
d  _________
5.0249
 0.5970, We can estimate the standardized mean difference ()
from studies that use matched groups or pre-post scores
50  50 _________ 2
vd  _______  0.5970  0.0418,
50  50 2(50  50) in one group. The formula for the sample estimate of d is
__ __
Y Y
and d  ______
1
S
2
. (12.19)
Within
______
SEd  0.0418  0.2045. This is the same formula as for independent groups,
12.11. However, there is an important difference in the
The correction factor J, Hedges g, its variance and mechanism used to compute SWithin. When we are working
standard error are given by with independent groups, the natural unit of deviation is
the standard deviation within groups. Therefore, this
( 3
J(50  50  2)  1 _________ )
4  98 1  0.9923, value is typically reported, or easily imputed. By contrast,
when we are working with matched groups, the natural
g  0.9923  0.5970  0.5924, unit of deviation is the standard deviation of the differ-
vg  0.99232  0.0418  0.0412, ence scores, and so this is the value that is likely to be
reported. To compute d from this kind of study, we need
and to impute the standard deviation within groups, which
______ would then serve as the denominator in formula 12.19.
SEg  0.0412  .2030. When working with a matched study, the standard de-
viation within groups can be imputed from the standard
The correction factor J is always less than 1.0, and so g deviation of the difference, using
will always be less than d in absolute value, and the vari- SDifference
ance of g will always be less than the variance of d. How- SWithin  ________
_______ , (12.20)
2(1 r)
ever, J will be very close to 1.0 unless df is very small
(say, less than 10) and so (as in this example) the differ- where r is the correlation between pairs of observations
ence is usually trivial. (for example, the correlation between pre-test and post-
Some slightly different expressions for the variance of test). Then we can apply formula 12.19 to compute d. The
d (and g) have been given by different authors and even variance of d is given by
by the same authors at different times. For example, the
denominator of the second term of the variance of d is ( d2
vd  __1n  ___ )
2n 2(1 r), (12.21)
given here as 2(n1  n2). This expression is obtained by
one method (assuming the ns become large with  fixed). where n is the number of pairs. The standard error of d is
An alternate
__
derivation (assuming ns become large just the square root of v,
with n  fixed) leads to a denominator in the second term __
that is slightly different, namely, 2(n1  n2 2). Unless n1 SEd  vd . (12.22)
and n2 are very small, these expressions will be almost
identical. Because the correlation between pre-test post-test
Similarly, the expression given here for the variance of scores is required to impute the standard deviation within
g is J2 times the variance of d, but many authors ignore the groups from the standard deviation of the difference, we
J2 term because it is so close to unity in most cases. Again, must assume that this correlation is known or can be esti-
though it is preferable to include this correction factor, mated with high precision. Often, the researcher will need
including it is likely to make little practical difference. to use data from other sources to estimate this correlation.
228 STATISTICALLY DESCRIBING STUDY OUTCOMES

Table 12.1 Formulas for Computing d in Designs with


Independent Groups
Reported Computation of Needed Quantities
__ __ Y1  Y2 n n d 2
Y1, Y2 SPooled, n1, n2 d  ______
SPooled
, v  ______
1 2 ________
n1n2  2(n  n ) 1 2
_____
n1  n2 n1  n2 ________
t, n1, n2 ________
d  t ______
n1n2 , v  ______ d2
n1n2  2(n1  n2)

F(n  n ) n n
d  _________
2
d
F, n1, n2 nn
1
1 2
, v  ______
2 ________
n n  2(n  n )
1
1 2
2

1 2
_____
n n n n
d  t (p)______
2
______ d
________
n n , v  n n  2(n  n )
p(one-tailed), n1, n2 1 1 2 1 2
1 2 1 2 1 2
_____
n n n n
d  t ( __p )______
2
______ d
________
n n , v  n n  2(n  n )
p(two-tailed), n1, n2 1 1 2 1 2

2 1 2 1 2 1 2

source: Authors compilation.


note: The function t1(p) is the inverse of the cumulative distribution function of
students t with n1  n2  2 degrees of freedom. Many computer programs and
spreadsheets provide functions that can be used to compute t1. Assume n1  n2 10,
so that df 18. Then, in Excel, for example, if the reported p-value is 0.05 (two-
tailed) TINV(p,df )  TINV(0.05,18) will return the required value (2.1009). If the
reported p-value is 0.05 (one-tailed), TINV(2p,df)  TINV(0.10,18) will return the
required value 1.7341. The F in row 3 of the table is the F-statistic from a one-way
analysis of variance. In rows 35 the sign of d must reflect the direction of the mean
difference.

If the correlation is not known precisely, one could work The correction factor J, Hedges g, its variance and
with a range of plausible correlations and use a sensitivity standard error are given by
analysis to see how these affect the results.
To compute Hedges g and associated statistics, we
3
J(n 1)  1 _________ (
4  49 1  0.9846, )
would use formulas 12.15 through 12.18. The degrees of
freedom for computing J is n  1, where n is the number g  0.9846  0.4225  0.4160,
of pairs. For example, suppose__ that a __
study has pretest vg  0.98462  0.0131 0.0127,
and posttest sample means Y1  103, Y2  100, sample
standard deviation of the difference SDifference  5.5, and and
sample size n  50 and a correlation between pretest and ______
posttest of r  0.7. SEg  0.0127  0.1127.
The standard deviation within groups is imputed from
the standard deviation of the difference by Table 12.2 provides formulas for computing d and its
variance for matched groups, working from information
5.5
SWithin  __________
________  7.1000. that may be reported in a published paper.
2(1 0.7)
Then d, its variance and standard error are computed as 12.3.3 Computing d and g, Analysis of
103 100 Covariance
d  _________
7.1000  0.4225,
We can estimate the standardized mean difference ()
(1
vd  ___
50
0.42252
 _______
2  50 ()2(1 0.7) )  0.0131, from studies that use analysis of covariance. The formula
for the sample estimate of d is
and __ __
Y 1Adjusted  Y 2Adjusted
______ d  ______________ . (12.23)
SEd  0.0131  .1145. SWithin
EFFECT SIZES FOR CONTINUOUS DATA 229

Table 12.2 Formulas for Computing d in Designs with Paired Groups

Reported Computation of Needed Quantities


__ __
__ __ Y1  Y2 _______
Y1, Y2 SDifference, r, n (number of pairs) (
d  _______
SDifference
______
) d2
2(1 r) , v  __1n  ___
2n
2(1 r) ( )
t (from paired t-test), r, n
2(1 r)
d  t _______
n , ( d2
v  __1n  ___
2n
2(1 r))
_______
F (from repeated measures ANOVA), r, n
2F(1 r)
d   ________
n , ( d2
v  __1n  ___
2n
2(1 r) )
______
p (one-tailed), r, n
2(1 r)
d  t1(p) _______
n , ( ) d2
v  __1n  ___
2n
2(1 r)
______

( 2 )
2(1 r)
d  t1 __p _______ v  ( __1n  ___
2n )
2
d
p (two-tailed), r, n n , 2(1 r)

source: Authors compilation.


note:The function t1(p) is the inverse of the cumulative distribution function of Students t with n 1 de-
grees of freedom. Many computer programs and spreadsheets provide functions that can be used to compute
t1. Assume n 19, so that df 18. Then, in Excel, for example, if the reported p-value is 0.05 (2-tailed),
TINV(p,df)  TINV(0.05,18) will return the required value (2.1009). If the reported p-value is 0.05
(1-tailed), TINV(2p,df)  TINV(0.10,18) will return the required value 1.7341. The F in row 3 of the table
is the F-statistic from a one-way repeated measures analysis of variance. In rows 35 the sign of d must
reflect the direction of the mean difference.

This is similar to the formula for independent groups SDifference in matched studies. Equivalently, SWithin may be
(12.11), but with two differences. The first difference is in computed as
the numerator, where we use adjusted means rather than ________


raw means. This has no impact on the expected value of MSWAdjusted
SWithin  _________
1 R2
. (12.25)
the mean difference, but serves to increase precision and
possibly removes bias due to pretest differences.
The second is in the mechanism used to compute SWithin. where MSWAdjusted is the covariate adjusted mean square
When we were working with independent groups, the within treatment groups.
natural unit of deviation was the unadjusted standard de- The variance of d is given by
viation within groups. Therefore, this value is typically (n  n )(1 R2) 2
d
reported, or easily imputed. By contrast, the analysis of v  _____________
1 2
n1n2  ________
2(n  n ) (12.26)
1 2
covariance works with the adjusted standard deviation
(typically smaller than the unadjusted value since vari- where n1 and n2 are the sample size in each group, and R
ance explained by covariates has been removed). There- is the multiple correlation between the covariates and the
fore, to compute d from this kind of study, we need to outcome.
impute the unadjusted standard deviation within groups, To compute Hedges g and associated statistics we
which is given by would use formulas 12.15 through 12.18. The degrees of
freedom for computing J is n1  n2  2  q, where n1 and
SAdjusted
SWithin  _______
_____ , (12.24) n2 are the number of subjects in each group, 2 is the num-
1 R 2
ber of groups, and q is the number of covariates.
__ For example,
__ suppose that a study has sample means
where SAdjusted is the covariate adjusted standard deviation Y1  103, Y2  100, sample standard deviations SAdjusted 
and R is the covariate outcome correlation (or multiple 5.5, and sample sizes n1  n2  50. Suppose that we know
correlation if there is more than one covariate). Note the that the covariate-outcome correlation for the single co-
similarity to (12.20) which we used to impute SWithin from variate (so that q  1) is R  0.7.
230 STATISTICALLY DESCRIBING STUDY OUTCOMES

Table 12.3 Formulas for Computing d in Designs with Independent Groups Using Analysis of
Covariance (ANCOVA)

Reported Computation of Needed Quantities


__ __
__ __ Y  Y2Adjusted
Adjusted
(n  n )(1 R2) d 2
Y1, Y2, SPooled, n1, n2, R, q d  _______________
1
SPooled , v  _____________
1 2
n1n2  ________
2(n  n ) 1 2
_____
n1  n2 _____2 (n1  n2)(1 R2) ________
________
2
t (from ANCOVA), n1, n2, R, q d  t ______
n1n2 1 R , v  _____________
n1n2  2(n d n )
1 2

F(n  n ) _____ (n  n )(1 R ) ________


d  _________
2 2
F (from ANCOVA), n1, n2, R, q nn
1
1 R , v  _____________
1 2
2
nn
2
 2(n d n )
1 2
1 2 1 2
_____
n  n _____ (n  n )(1 R ) ________
d  t (p)______
2 2
p (one-tailed, from ANCOVA), n1, n2, R, q 1
n n 1 R , v 
1
1 2
_____________
2
nn
2
 2(n d n )
1 2
1 2 1 2
_____
n  n _____ (n  n )(1 R ) ________
d  t ( __p )______
2 2
p (two-tailed, from ANCOVA), n1, n2, R, q 1
n n 1 R , v 
1 _____________
2
nn
2
 2(n d n )
1 2

2 1 2 1 2 1 2

source: Authors compilation.


note: The function t1(p) is the inverse of the cumulative distribution function of students t with n1  n2 2 q degrees of
freedom, q is the number of covariates, and R is the covariate outcome correlation or multiple correlation. Many computer
programs and spreadsheets provide functions that can be used to compute t1. Assume n1  n2  11, and q  2, so that
df 18. Then, in Excel, for example, if the reported p-value is 0.05 (2-tailed), TINV(p, df)  TINV(0.05,18) will return
the required value (2.1009). If the reported p-value is 0.05 (1-tailed), TINV(2p, df)  TINV(0.10, 18) will return the required
value 1.7341. The F in row 3 of the table is the F-statistic from a one-way analysis of covariance. In rows 35 the sign of d
must reflect the direction of the mean difference.

The unadjusted standard deviation within groups is im- Table 12.3 provides formulas for computing d and its
puted from SAdjusted by variance for analysis of covariance, working from infor-
SAdjusted mation that may be reported in a published paper.
SWithin  ________
_______  7.7015.
1 0.72
12.3.4 Including Different Study Designs in the
Then d, its variance and standard error are computed as Same Analysis
103 100 To include data from two or three study designsinde-
d  _________
7.7015
 0.3895,
pendent groups, matched groups, analysis of covariance
(50  50)(1 0.72) 0.3895 2
in the same meta-analysis, one simply computes the effect
vd  ________________
50  50
 _______________
2(50  50  2 1)
 0.0222,
size and variance for each study using the appropriate for-
and mulas. The effect sizes have the same meaning regardless
______
of the study design, and the variances have been computed
SEd  0.0222  .1490. to take account of the study design. We can therefore use
these effect sizes and variances to perform a meta-analysis
The correction factor J, Hedges g, its variance and without concern for the different designs in the original
standard error are given by studies. Note that the phrase different study designs is used
here to refer to the fact that one study used independent
( 3
J(50  50  21)  1 _________
4  97 1
 0.9922, ) groups and another used a pre-post design or matched
groups, and yet another included covariates. It does not
g  0.9922  0.3895  0.3834, refer to the case where one study was a randomized trial
vg  0.99222  0.0222  0.0219, or quasi-experimental design and another was observa-
tional. The issues involved in combining these kinds of
and studies are beyond the scope of this chapter. __ __
______ For __ study designs the direction of the effect (Y1  Y2
__ all
Sg  0.0219  .1480. or Y2  Y1) is arbitrary, except that the researcher must
EFFECT SIZES FOR CONTINUOUS DATA 231

decide on a convention and then apply this consistently. with standard error
For example, if a positive difference will indicate that the ___
treated group did better than the control group, then this SEz  Vz . (12.30)
convention must apply for studies that used independent
designs and for studies that used pre-post designs. It must The effect size z and its variance would be used in the
also apply for all outcome measures. In some cases (for analysis, which would yield a combined effect, confi-
example, if some studies defined outcome as the number dence limits, and so on, in the Fishers z metric. We could
of correct answers while others defined outcome as the then convert each of these values back to correlation units
number of mistakes) it might be necessary to reverse the using
computed sign of the effect size to ensure that the con- e 1 ,2z
vention is applied consistently. r  ______
e2z 1
(12.31)

where ex  exp(x) is the exponential (anti-log) function.


12.4 CORRELATION AS AN EFFECT SIZE
For example, if a study reports a correlation of 0.50 with
The correlation coefficient itself can serve as the effect- a sample size of 100, we would compute
size index. The correlation is an intuitive measure that,
like , has been standardized to take account of different (
1 0.5
z  0.5  ln ______
1 0.5
 0.5493, )
metrics in the original scales. The population parameter
1
is denoted by . vz  ______
1003  0.0103.

12.4.1 Computing r and


______
The estimate of the correlation parameter  is simply the SEz  0.0103  0.1015.
sample correlation coefficient, r. The variance of r is ap-
proximately To convert the Fishers z value back to a correlation,
(1 r2)2
we would use
vr  _______
n 1
, (12.27)
e2(0.5493)
1
r  _________
e2(0.5493) 1
 0.5000.
where n is the sample size.
Most meta-analysts do not perform syntheses on the
correlation coefficient itself because the variance depends 12.5 CONVERTING EFFECT SIZES FROM
so strongly on the correlation (see, however, chapter 17, ONE METRIC TO ANOTHER
this volume; Hunter and Schmidt 2004). Rather, the cor-
relation is converted to the Fishers z scale (not to be con- Earlier I discussed the case where different study designs
fused with the z-score used with significance tests), and were used to compute the same effect size. For example,
all analyses are performed using the transformed values. studies that used independent groups and studies that
The results, such as the combined effect and its confi- used matched groups were both used to yield estimates of
dence interval, would then be converted back to correla- the mean difference D, or studies that used independent
tions for presentation. This is analogous to the procedure groups, studies that used matched groups, and studies
used with odds ratios or risk ratios where all analyses are that used analysis of covariance were all used to yield
performed using log transformed values, and then con- estimates of the standardized mean difference, d or g.
verted back to the original metric. There is no problem in combining these estimates in a
The transformation from sample correlation r to Fish- meta-analysis because the effect size has the same mean-
ers z is given by ing in all studies.
Consider, however, the case where some studies report
( )
1 r .
z  0.5  ln ____
1 r
(12.28) a difference in means, which is used to compute a stan-
dardized mean difference. Others report a difference in
The variance of z is (to an excellent approximation) proportions which is used to compute an odds ratio (see
1 , chapter 13, this volume). And others report a correlation.
vz  ____
n3 (12.29) All the studies address the same question, and we want to
232 STATISTICALLY DESCRIBING STUDY OUTCOMES

Binary data Continuous data Correlational data

Standardized
Log odds ratio Mean Difference Correlation
(Cohens d )

Bias-corrected
Standardized
Mean Difference
(Hedges g)

Figure 12.2 Converting Among Effect Sizes


source: Authors compilation.
note: This schematic outlines the mechanism for incorporating multiple kinds of data in the same
meta-analysis. First, each study is used to compute an effect size and variance in its native
index log odds ratio for binary data, d for continuous data, and r for correlational data. Then, we
convert all of these indices to a common index, which would be either the log odds ratio, d, or r.
If the final index is d, we can move from there to Hedges g. This common index and its variance
are then used in the analysis.

include them in one meta-analysis. Unlike the earlier mation. It may also be better than the alternative of sepa-
case, we are now dealing with different indices, and we rately summarizing studies with effect sizes in different
need to convert them to a common index before we can metrics, which leads to not one, but several summaries.
proceed. As in section 12.3.4, we assume that all the stud- The decision to use these transformations or to exclude
ies are randomized trials, all are quasi-experimental, or some studies will need to be made on a case-by-case basis.
all are observational. For example, if all the studies originally used continuous
Here I present formulas for converting between an scales but some had been dichotomized (into pass-fail) for
odds ratio and d, or between d and r. By combining for- reporting, a good case could be made for converting these
mulas, it is also possible to convert from an odds ratio, effects back to d for the meta-analysis. By contrast, if
via d, to r (see figure 12.2). In every case the formula for some studies originally used a dichotomous outcome and
converting the effect size is accompanied by a formula to others used a continuous outcome, the case for combining
convert the variance. the two kinds of data would be more tenuous. In either
When we convert from an odds ratio to a standardized case, a sensitivity analysis to compare the effect size from
mean difference, we then apply the sampling distribution the two (or more) kinds of studies would be important.
for d to data that were originally binary, despite the fact
that the new sampling distribution does not really apply. 12.5.1 Converting from the Log Odds Ratio to d
The same problem holds for any of these conversions.
If this approach is therefore less than perfect, it is often We can convert from a log odds ratio [ln(o)] to the stan-
better than the alternative of omitting the studies that hap- dardized mean difference d using
__
pened to use an alternate metric. This would involve loss ln(o)3
of information, and possibly the systematic loss of infor- d  _______
 , (12.32)
EFFECT SIZES FOR CONTINUOUS DATA 233

Table 12.4 Formulas for Computing r in Designs with Independent Groups

Reported Computation of Needed Quantities


(1  r2)2 r ,
r, n vr  _______
n1
, z  0.5 ln 1_____
1r ( ) 1
vz  _____
n3
_______
(1  r2)2
t, n t  n  2
2
t2
r   ________ , vr  _______
n1
, ( r ,
z  0.5 ln 1_____
1r ) 1
vz  _____
n3
(1  r ) 2 2
 r , v  _____
( , z  0.5 ln( 1_____
1r )
r )
t, r n  t2 1 r2  2,
_____ v  _______ 1
2 n1
r n3 z

r  2, v  (1 r ) 2 2
 r , v  _____
n  [t (p)] ( 1 , z  0.5 ln( 1_____
1r )
r )
p(one-tailed), r 1 2 _____ 2 _______ 1
2 n1
r n3 z

(1  r ) 2 2
1  r , v  _____
n  [ t ( __p ) ] ( _____
r )
, z  0.5 ln( _____
1r )
2 2
1 r  2, v  _______ 1
1
p(two-tailed), r 2 n1 r n3 z
2

source: Authors compilation.


note: The function t1(p) is the inverse of the cumulative distribution function of Students t with n  2
degrees of freedom. Many computer programs and spreadsheets provide functions that can be used to
compute t1. Assume n  20, so that df  18. Then, in Excel, for example, if the reported p-value is
0.05 (2-tailed), TINV(p, df)  TINV(0.05, 18) will return the required value (2.1009). If the reported
p-value is 0.05 (1-tailed), TINV(2p, df)  TINV(.10, 18) will return the required value 1.7341.

d
where  is the mathematical constant (approximately ln(o)  ___
__ , (12.34)
3
3.14159). The variance of d would then be
3vln(o) where  is the mathematical constant (approximately
vd  _____
2
, (12.33) 3.14159). The variance of ln(o) would then be

where vln(o) is the variance of the log odds ratio. This 2vd
Vln(o)  ____
3
. (12.35)
method was originally proposed by Victor Hasselblad
and Larry Hedges but other variations have also been
For example, if d  0.5000 and vd  0.0205 then
proposed (Hasselblad and Hedges 1995; see also San-
chez-Meca, Marin-Martinez, and Chacon-Moscoso 2003; 3.1415*0.5000
Whitehead 2002). ln(o)  _____________
__  0.9069, (12.36)
3
For example, if the log odds ratio were ln(o)  0.9070
with a variance of vln(o)  0.0676, then and
__
0.90703
d  ________  0.5000, 3.14152 * 0.0205
3.1416 vln(o)  ______________
3  0.0674. (12.37)
with variance
3  0.0676
vd  _________
12.5.3 Converting from r to d
3.14162  0.0205.
We convert from a correlation (r) to a standardized mean
Table 12.4 provides formulas for computing r and its vari- difference (d) using
ance from information that may be reported in a published
paper. 2r ,
d  ______
_____ (12.38)
1 r
2

12.5.2 Converting from d to the and the variance of d computed in this way (converted
Log Odds Ratio from r) is
We can convert from the standardized mean difference d 4v
vd  _______
r
. (12.39)
to the log odds ratio [ln(o)] using (1r2)3
234 STATISTICALLY DESCRIBING STUDY OUTCOMES

For example, if r  0.50 and vr  0.0058, then large enough to be important, it may be helpful to com-
pare it with the effects of related interventions that are
2  0.500
d  __________
_________ 1.1547, appropriate for the same population and have similar
1 0.500
2
characteristics (such as cost or complexity). Compendia
and the variance of d is of effect sizes from various interventions may be helpful
in making such comparisons (for example, Lipsey and
4  0.0058
vd  __________
( 1 0.502 )3  0.0550.
Wilson 1993).
Alternatively, it may be helpful to compare an effect
size to other effect sizes that are well understood con-
12.5.4 Converting from d to r ceptually. For example, one might compare an educa-
tional effect size to the effect size corresponding to one
We can convert from a standardized mean difference (d) year of growth. Even here, however, context is critical.
to a correlation (r) using For example, one years growth in achievement from
d kindergarten to grade one corresponds approximately to
r  _______
_____ , (12.40) a d of 1.0, whereas one years growth in achievement
d  a
2

from grade eleven to grade twelve corresponds to a d of


where a is a correction factor for cases where n1 n2, about 0.1.
(n  n )2 Technical methods can help with the interpretation of
a  ________
1 2
n1n2 . (12.41) effect sizes in one limited way. They can permit effect
sizes computed in one metric (such as a correlation coef-
The correction factor a is based on the ratio of n1 to n2, ficient) to be expressed in another metric (such as a
rather than the absolute values of these numbers. There- standardized mean difference or an odds ratio). Such re-
fore, if n1 and n2 are not known precisely, use n1  n2, expression of effect sizes in other metrics can make it
which will yield a  4. The variance of r computed in this easier to compare the effect with other effect sizes that
way (converted from d) is may be known to the analyst and be judged to be relevant
a 2v (see Rosenthal and Rubin 1979, 1982).
vr  ________
d
3. (12.42)
2(d  a)
12.7 RESOURCES
For example, if n1  n2, d  1.1547 and vd  0.0550,
then For a more extensive discussion of the topics covered in
this chapter, see Gene Glass, Barry McGaw, and Mary
1.1547
r  ___________
__________  0.5000, Smith (1981), Mark Lipsey and David Wilson (2001),
1.1547  4
2
Robert Rosenthal, Ralph Rosnow, and Donald Rubin
and the variance of r converted from d will be (2000), and Michael Borenstein, Larry Hedges, Julian
Higgins, and Hannah Rothstein (2009a, 2009b).
4  0.0550
2
vr  ____________)3  0.0058.
(1.1547  4
2
12.8 ACKNOWLEDGMENTS

12.6 INTERPRETING EFFECT SIZES This work was funded in part by the following grants
from the National Institutes of Health: Combining data
I have addressed the calculation of effect-size estimates types in meta-analysis (AG021360), Publication bias
and their sampling distributions, a technical exercise in meta-analysis (AG20052), Software for meta-re-
which depends on statistical theory. Equally important, gression (AG024771), from the National Institute on
however, is the interpretation of effect sizes, a process Aging, under the direction of Dr. Sidney Stahl; and For-
that requires human judgment and which is not amenable est plots for meta-analysis (DA019280) from the Na-
to technical solution (see Cooper 2008). tional Institute on Drug Abuse, under the direction of Dr.
To understand the substantive implications of an effect Thomas Hilton.
size, we need to look at the effect size in context. For My thanks to Julian Higgins for his assistance in clari-
example, to judge whether the effect of an intervention is fying some of the issues discussed in this chapter, to
EFFECT SIZES FOR CONTINUOUS DATA 235

Steven Tarlow for checking the accuracy of the formulas, Hedges, Larry V. 1981. Distribution Theory for Glasss
and to Hannah Rothstein for helping me to put it all in Estimator of Effect Size and Related Estimators. Journal
perspective. of Educational Statistics 6(2): 10728.
Hedges, Larry V., and Ingram Olkin. 1985. Statistical Mo-
dels for Meta-Analysis. New York: Academic Press.
12.9 REFERENCES Hunter, John E. and Frank L. Schmidt. 2004. Methods
of Meta-Analysis: Correcting Error and Bias in Research
Borenstein Michael. 1994. The Case for Confidence Inter- Findings, 2d ed. Newbury Park, Calif.: Sage Publications.
vals in Clinical Trials. Controlled Clinical Trials 15(5): Lipsey, Mark W., and David B. Wilson. 1993. The Efficacy
41128. of Psychological, Educational, and Behavioral Treatment:
Borenstein, Michael, Larry V. Hedges, Julian Higgins, and Confirmation of Meta-Analysis. American Psychologist
Hannah R. Rothstein. 2009a. Introduction to Meta-Analysis. 48(12): 11811209.
Chichester: John Wiley & Sons. . 2001. Practical Meta-Analysis. Thousand Oaks,
. 2009b. Computing Effect Sizes for Meta-Analysis. Calif.: Sage Publications.
Chichester: John Wiley & Sons. Rosenthal, Robert, and Donald B. Rubin. 1979. A Note
Cohen, Jacob. 1969. Statistical Power Analysis for the Beha- on Percentage of Variance Explained as a Measure of
vioral Sciences. New York: Academic Press. Importance of Effects. Journal of Applied Social Psycho-
. 1988. Statistical Power Analysis for the Behavioral logy 9(5): 39596.
Sciences, 2nd ed. New York: Academic Press. . 1982. A Simple, General Purpose Display of Ma-
Cooper, Harris. 2008. The Search for Meaningful Ways to gnitude of Experimental Effect. Journal of Educational
Express the Effects of Interventions. Child Development Psychology 74(2): 16669.
Perspectives 2: 18186. Rosenthal, Robert, Ralph L. Rosnow, and Donald B. Rubin.
Glass, Gene V. 1976. Primary, Secondary, and Meta-Analy- 2000. Contrasts and Effect Sizes in Behavioral Research:
sis of Research. Educational Researcher 5(10): 38. A Correlational Approach. Cambridge: Cambridge Univer-
Glass, Gene V., Barry McGaw, and Mary L. Smith. 1981. sity Press.
Meta-Analysis in Social Research. Thousand Oaks, Calif.: Sanchez-Meca, Julio., Fulgencio Marin-Martinez, and Sal-
Sage Publications. vador Chacon-Moscoso. 2003. Effect-Size Indices for Di-
Grissom, R. J., and J. J. Kim. 2005. Effect Sizes for Research. chotomized Outcomes in Meta-Analysis. Psychological
Mahwah, N.J.: Lawrence Erlbaum. Methods 8(4): 44867.
Hasselblad, Victor and Hedges, Larry V. 1995. Meta-Ana- Whitehead, Anne. 2002. Meta-Analysis of Controlled Clini-
lysis of Screening and Diagnostic Tests. Psychological cal Trials (Statistics in Practice). Chichester: John Wiley &
Bulletin 117(1): 16778. Sons.
13
EFFECT SIZES FOR DICHOTOMOUS DATA

JOSEPH L. FLEISS JESSE A. BERLIN


Columbia University Johnson & Johnson Pharmaceutical
Research and Development

CONTENTS
13.1 Introduction 238
13.2 Difference Between Two Probabilities 239
13.2.1 A Critique of the Difference 239
13.2.2 Inference About the Difference 239
13.2.3 An Example 239
13.3 Ratio of Two Probabilities 239
13.3.1 Rate Ratio 239
13.2 Inferences About the Rate Ratio 240
13.3 Problems with the Rate Ratio 240
13.4 Phi Coefficient 241
13.4.1 Inference About  241
13.4.2 Variance Estimation for  242
13.4.3 Problems with the Phi Coefficient 242
13.5 Odds Ratio 243
13.5.1 A Single Fourfold Table 244
13.5.2 An Alternative Analysis: (O  E ) / V 244
13.5.3 Inference in the Presence of Covariates 245
13.5.3.1 Regression Analysis 245
13.5.3.2 Mantel-Haenszel Estimator 246
13.5.3.3 Combining Log Odds Ratios 246
13.5.3.4 Exact Stratified Method 247
13.5.3.5 Comparison of Methods 247
13.5.3.6 Control by Matching 247
13.5.4 Reasons for Analyzing the Odds Ratio 248
13.5.5 Conversion to Other Measures 250

237
238 STATISTICALLY DESCRIBING STUDY OUTCOMES

13.6 Acknowledgments 251


13.7 References 251

13.1 INTRODUCTION The fourth measure,  (the phi coefficient), is defined


later.
In many studies measurements are made on binary (di- The parameters  and  are such that the value 0 indi-
chotomous) rather than numerical scales. Examples in- cates no association or no difference. For the other two
clude studies of attitudes or opinions (the two categories parameters, RR and , the logarithm of the measure is
for the response variable being agree or disagree with typically analyzed to overcome the awkward features that
some statement), case-control studies in epidemiology the value 1 indicates no association or no difference, and
(the two categories being exposed or not exposed to some that the finite interval from 0 to 1 is available for indexing
hypothesized risk factor), and intervention studies (the negative association, but the infinite interval from 1 on up
two categories being improved or unimproved or, in stud- is available for indexing positive association. With the
ies of medical interventions, experiencing a negative event parameters transformed so that the value 0 separates neg-
or not). In this chapter we present and analyze four popu- ative from positive association, we present for each the
lar measures of association or effect appropriate for cate- formula for its maximum likelihood estimator, say L, and
gorical data: the difference between two probabilities, the for its non-null standard error, say SE. (A non-null stan-
ratio of two probabilities, the phi coefficient, and the odds dard error is one in which no restrictions are imposed on
ratio. Considerations in choosing among the various mea- the parameters, in particular, no restrictions that are as-
sures are presented, with an emphasis on the context of sociated with the null hypothesis.)
research synthesis. The odds ratio is shown to be the mea- We assume throughout that the sample sizes within
sure of choice according to several statistical criteria, and each individual study are large enough for classical large-
the major portion of the discussion is devoted to methods sample methods assuming normality to be valid. The
for making inferences about this measure under a variety quantities L, SE and
of study designs. The chapter wraps up by discussing con-
verting the odds ratio to other measures that have more
w 1/SE 2 (13.4)
straightforward substantive interpretations.
Consider a study in which two groups are compared
with respect to the frequency of a binary characteristic. are then adequate for making the following important in-
Let 1 and 2 denote the probabilities in the two underly- ferences about the parameter being analyzed within each
ing populations of a subject being classified into one of individual study, and for pooling a given studys results
the two categories, and let P1 and P2 denote the two sam- with the results of others. Let C denote the value cutting
ple proportions based on samples of sizes n1 and n2. Three off the fraction  in the upper tail of the standard normal
of the four parameters to be studied in this chapter are the distribution. The interval L  C SE is, approximately, a
simple difference between the two probabilities, 100 (1  2) percent confidence interval for the parame-
ter that L is estimating. As detailed in chapter 14 of this
   1   2; (13.1) volume (see, in particular, formula 14.14), the given
studys contribution to the numerator of the pooled (meta-
the ratio of the two probabilities, or the rate ratio (referred analytic) estimate of the parameter of interest is, under
to as the risk ratio or relative risk in the health sciences), the model of fixed effects, wL, and its contribution to the
denominator is w. (In a so-called fixed-effects model for
RR  1 / 2; (13.2) k studies, the assumption is made that there is interest in
only these studies. In a random effects model, the as-
and the odds ratio, sumption is made that these k studies are a sample from a
larger population of studies. We restrict our attention to
  (1/(1 1))/(2/(1 2)). (13.3) fixed-effects analyses.)
EFFECT SIZES FOR DICHOTOMOUS DATA 239

13.2 THE DIFFERENCE BETWEEN Table 13.1 Mortality Rates from Lung Cancer
TWO PROBABILITIES (Per 100,000 Person-Years)

Exposed to Asbestos
13.2.1 A Critique of the Difference
Smoker Yes No
The straightforward difference,   1  2, is the
simplest parameter for estimating the effect on a binary Yes 601.6 122.6
variable of whatever characteristic or intervention distin- No 58.4 11.3
guishes group 1 from group 2. Its simplicity and its inter-
pretability with respect to numbers of events avoided (or source: Authors compilation.
caused) are perhaps its principal virtues, especially in the
context of research syntheses. A technical difficulty with
 is that its range of variation is limited by the magni- investigator might nevertheless choose to use  as the
tudes of 1 and 2: the possible values of  when 1 and measure of effect. The estimate of the difference is simply
2 are close to 0.5 are greater than when 1 and 2 are
close to 0 or to 1. If the values of 1 and 2 vary across D  P1  P2
the k studies whose results are being synthesized, the as-
The factor w by which a studys estimated difference is
sociated values of  may also vary. There might then be
to be weighted in a fixed-effects meta-analysis is equal to
the appearance of heterogeneity (that is, nonconstancy in
the reciprocal of its estimated squared standard error,
the measure of effect) in this scale of measurement due
1
to the mathematical constraints imposed on probabilities
rather than to substantive reasons. Empirically, two sepa- (
P1(1 P1) _________
w 1/SE 2  _________
n1
P (1 P )
 2 n2 2 ) . (13.5)
rate investigations have found that across a wide range of
meta-analyses, heterogeneity among studies is most pro- 13.2.3 An Example
nounced when using the difference as the measure of ef-
fect (Deeks 2002; Engels et al. 2000). Consider as an example a comparative study in which one
The example in table 13.1 illustrates this phenomenon. treatment was applied to a randomly constituted sample of
The values there are mortality rates from lung cancer in n1  80 subjects and the other was applied to a randomly
four groups of workers: exposed or not to asbestos in the constituted sample of n2  70 subjects. If the proportions
work place cross-classified by cigarette smoker or not unimproved are P1  0.60 and P2  0.80, the value of D is
(Hammond, Selikoff, and Seidman 1979). The asbestos 0.20 and the standard error of D is
no asbestos difference in lung cancer mortality rates for
0.60  0.40 __________
 0.20
cigarette smokers is 601.6 to 122.6, or approximately 500 (
SE  __________
80  0.80 70 )1/2
 0.0727.
deaths per 100,000 person-years. Given the mortality rate
of 58.4 per 100,000 person-years for nonsmokers exposed The resulting weighting factor is w  1/SE 2  189.2.
to asbestos, it is obviously impossible for the asbestos-no An approximate 95 percent confidence interval for  is
asbestos difference between lung cancer mortality rates D  1.96.SE. In this example, the interval is 0.20 
for nonsmokers to come anywhere near 500 per 100,000 1.96  0.0727, or the interval from 0.34 to 0.06.
person-years. Heterogeneitya difference between smok-
ers and nonsmokers in the effect of exposure to asbestos, 13.3 RATIO OF TWO PROBABILITIES
with effect measured as the difference between two
ratesexists, but it is difficult to assign biological mean- 13.3.1 Rate Ratio
ing to it.
The ratio of two probabilities, RR  1 / 2, is another
13.2.2 Inference about the Difference popular measure of the effect of an intervention. Its use
requires that one be able to distinguish between the two
In spite of the possible inappropriateness for a meta- outcomes so that one is in some sense the undesirable
analysis of the difference between two probabilities, the one and the other is in some sense the preferred one.
240 STATISTICALLY DESCRIBING STUDY OUTCOMES

The names given to this measure in the health sciences, An approximate two-sided 100(1  ) percent confi-
the risk ratio or relative risk, reflect the fact that the mea- dence interval for RR is obtained by taking the antiloga-
sure is not symmetric (that is, 1 / 2 (1  1) / (1  2)), rithms of the limits
but is instead the ratio of the first groups probability of
an undesirable outcome to the seconds probability. (In ln(rr)  C/2 SE (13.8)
fact, in the health sciences, the term rate is often reserved
for quantities involving the use of person-time, that is, the and
product of the number of individuals and their length of
follow-up. For the remainder of this chapter, the term will ln(rr)  C/2 SE. (13.9)
be used in its more colloquial sense, or simply a number
of events divided by a number of individuals.) RR is a Consider again the numerical example in section 13.2.3:
natural measure of effect for those investigators accus- n1  80, P1  0.60; n2  70 and P2  0.80. The estimated
tomed to comparing two probabilities in terms of their rate ratio is rr  0.60/0.80  0.75, so group 1 is estimated
proportionate (relative) difference, that is, their ratio. If to be at a risk that is 25 percent less than group 2s risk.
1 is the probability of an untoward outcome in the ex- Suppose that a confidence interval for RR is desired. One
perimental group and if 2 is the same in the control or first obtains the value ln(rr)  0.2877, and then obtains
comparison group, the proportionate reduction in the the value of its estimated standard error,
likelihood of such an outcome is
2  1
(
0.40
SE  ________ 0.20
 ________
80  0.60 70  0.80 )1/2
 0.01191/2  0.1091
_______
2 1 RR (13.6)
(the associated weighting factor is w  1/SE2  84.0). An
Inferences about a proportionate reduction may there- approximate 95 percent confidence interval for ln(RR)
fore be based on inferences about the associated relative has as its lower limit
risk.
ln(RRL)  0.2877 1.96  0.1091 0.5015
13.3.2 Inferences About the Rate Ratio and as its upper limit
RR is estimated simply as
ln(RRU)  0.2877  1.96  0.1091  0.0739.
rr  P1/P2 .
The resulting interval for RR extends from RRL  exp
The range of variation of rr is an inconvenient one for (0.5015)  0.61 to RRU  exp(0.0739)  0.93. The
drawing inferences. Only the finite interval from 0 to 1 is corresponding interval for the proportionate reduction in
available for indexing a lower risk in population 1, but the risk, finally, extends from 1  RRU  0.07 to 1  RRL 
infinite interval from 1 up is, theoretically, available for 0.39, or from 7 percent to 39 percent. Notice that the con-
indexing a higher risk in population 1. A corollary is that fidence interval is not symmetric about the point estimate
the value 1 (rather than the more familiar and convenient of 25 percent.
value 0) indexes no differential risk. It is standard prac-
tice to undo these inconveniences by carrying out ones 13.3.3 Problems with the Rate Ratio
statistical analysis on the logarithms of the rate ratios,
Thanks to its connection with the intuitively appealing
and by transforming point and interval estimates back to
proportionate reduction in the probability of an undesir-
the original units by taking antilogarithms of the results.
able response, the rate ratio continues to be a popular
An important by-product is that the sampling distribution
measure of the effect of an intervention in controlled
of the logarithm of rr is more nearly normal than the sam-
studies. It is also a popular and understandable measure
pling distribution of rr.
of association in nonexperimental studies. There are at
The large sample standard error of ln(rr), the natural
least two technical problems with this measure, however.
logarithm of rr, may be estimated as
If the chances are high in the two groups being compared
(
1 P1 _____
SE  _____
1 P2
nP  n P
1 1 2 2
)
1/2
(13.7)
(experimental treatment versus control or exposed versus
not exposed) of a subjects experiencing the outcome
EFFECT SIZES FOR DICHOTOMOUS DATA 241

under study, many values of the rate ratio are mathemati- Table 13.2 Underlying Probabilities Associated with
cally impossible. For example, if the probability 2 in the Two Binary Characteristics
comparison group is equal to 0.40, only values for RR in
the interval 0
RR
2.5 are possible. The possibility Y
therefore exists of the appearance emerging of study-to- X Positive Negative Total
study heterogeneity in the value of RR only because the
studies differ in their values of 2. As we will see in sec- Positive 11 12 1
tion 13.5, this kind of constraint does not characterize the Negative 21 22 2
odds ratio, a measure related to the rate ratio.
A second difficulty with RR, one that is especially im- Total 1 2 1
portant in epidemiological studies, is that it is not esti- source: Authors compilation.
mable__ from data collected in retrospective studies. Let E
and E denote exposure __ or not to the risk factor under
study, and let D and D denote the development or not of Table 13.3 Observed Frequencies in a Study
the disease under study. The rate ratio may then be ex- Cross-Classifying Subjects
pressed as the ratio of the two conditional probabilities of
Y
developing disease,
X Positive Negative Total
__ .
P(D E)
RR  _______
P(D E )
Positive n11 n12 n1
Studies that rely on random, cross-sectional sampling Negative n21 n22 n2
from the entire population and those that rely on separate Total n1 n2 n
random samples from the exposed __ and unexposed popu-
lations permit P(D|E) and P(D|E), and thus RR, to be esti- source: Authors compilation.
mated. In a retrospective study, subjects who have devel-
oped the disease under investigation, as well as subjects
who have not, are evaluated with respect to whether or frequencies in the 2  2 table cross-classifying subjects
not they had been exposed to the putative risk factor. Such categorizations on the two variables. Let the two levels of
__ X be coded numerically as 0 or 1, and let the same be
a study permits P(E |D) and P(E D R) to be estimated, but
these two probabilities are not sufficient to estimate RR. done for Y. The product-moment correlation coefficient
We will see in Section 5 that, unlike RR, the odds ratio is in the population between the two numerically coded
estimable from all three kinds of studies. Jonathan Deeks variables is equal to
presented an in-depth discussion of the various measures    
 _____________
11 22 12 21
__________
    (13.10)
of effect described here, including consideration of rate 1 2 1 2

ratio for a harmful, versus a beneficial, outcome from and its maximum likelihood estimator (assuming ran-
the same study (2002). He noted, for example that the dom, cross-sectional sampling) is equal to
apparent heterogeneity of the rate ratio can be very sensi-
tive to the choice of benefit versus harm as the outcome n n n n
 ____________
11 22 12 21 .
_________
n n n n (13.11)
measure. 1 2 1 2

Note that is closely related to the classical chi-square


13.4 PHI COEFFICIENT statistic for testing for association in a fourfold table:
2  n . . 2. If numerical values other than 0 or 1 are as-
13.4.1 Inference About signed to the categories of X or Y, the sign of may
change but not its absolute value.
Consider a series of cross-sectional studies in each of The value of for the frequencies in table 13.4 is
which the correlation coefficient between a pair of binary
135 10 15  40
random variables, X and Y, is the measure of interest.  __________________
_________________  0.13,
150  50  175  25
Table 13.2 presents notation for the underlying parame-
ters and table 13.3 presents notation for the observed a modest association.
242 STATISTICALLY DESCRIBING STUDY OUTCOMES

Table 13.4 Hypothetical Frequencies in a pling from the population giving rise to the observed
Fourfold Table sample and taking an empirical variance estimate, or di-
rectly estimating a 95 percent confidence interval from
Y the distribution of resampled estimates. Technically, the
X Positive Negative Total
steps are as follows: first, sample an observation (with
replacement) from the observed data (that is, put the sam-
Positive 135 15 150 pled observation into the new dataset, record its value,
Negative 40 10 50 then put it back so it can be sampled again); second, re-
peat this n times for a sample of n observations; third,
Total 175 25 200 when n observations have been sampled, calculate the
source: Authors compilation. statistic of interest (in this case, );
and, fourth, repeat
this entire process a large number of times, say 1,000,
generating 1,000 estimates of the value of interest. The
Yvonne Bishop, Stephen Fienberg, and Paul Holland bootstrap estimate of  is just the sample mean of the
ascribe the following formula for the large-sample stan- 1,000 estimates. A variance estimate can then be obtained
dard error of to Yule (1975, 38182): empirically from the usual sample variance of the 1,000
estimates, or a 95 percent confidence interval can be
1 1
____
___
n ( 2 ________________
2  1  __
2
( p1  p2 )( p 1  p 2)
( _________ )
p1 p 1 p2 p 2
obtained directly from the distribution of the 1,000 esti-
mates.
( p1  p2 ) ( p 1  p 2)
[ ])
2 2 1/2
Although the bootstrap is technically described in
__34 2 _________
p p  _________
p p . (13.12)
1 2 1 2 terms of the notion of sampling with replacement, it may
be easier to think of the process as generating an infi-
where p1 and p2 are the overall (marginal) row probabil-
nitely large population of individual subjects, that exactly
ities and p1 and p2 are the overall (marginal) column
reflects the distribution of observations in the observed
probabilities.
sample. Each bootstrap iteration can then be considered a
For the data under analysis,
sample from this population.
1
SE ____
____ (1.245388)1/2  0.079 (13.13)
200
13.4.3 Problems with the Phi Coefficient

13.4.2 Variance Estimation for A number of theoretical problems with the phi coefficient
limit its usefulness as a measure of association between
A straightforward large sample procedure known as the two random variables. If the binary random variables X
jackknife (Quenouille 1956; Tukey 1958) was used ex- and Y result from dichotomizing one or both of a pair of
tensively in the past and provided an excellent approxi- continuous random variables, the value of  depends
mation to the standard error of (as well as the standard strongly on where the cut points are set (Carroll 1961;
errors of other functions of frequencies (Fleiss and Davies Hunter and Schmidt 1990). A second problem is that two
1982) without the necessity for memorizing or program- studies with populations having different marginal distri-
ming a formula as complicated as the one in expression butions (for example, 1 for one study different from 1
13.12. The jackknife involves reestimating the  coeffi- for another) but otherwise identical conditional probabil-
cient (or other function) with one unit deleted from the ity structures (for example, equal values of 11 / 1 in the
sample and repeating this process for each of the units in two study populations) may have strongly unequal phi
the sample, then calculating a mean and variance using coefficients. An untoward consequence of this phenome-
these estimates. non is that there will be the appearance of study to-study
With the existence of faster and more powerful com- heterogeneity (that is, unequal correlations across stud-
puters, methods for variance estimation that are compu- ies), even though the associations defined in terms of the
tationally intensive have become possible. In particular, conditional probabilities are identical.
the so-called bootstrap approach has achieved some pop- Finally, the phi coefficient shares with other correla-
ularity (Efron and Tibshirani 1993; Davison and Hinkley tion coefficients the characteristic that it is an invalid
1997). Essentially, the bootstrap method involves resam- measure of association when a method of sampling other
EFFECT SIZES FOR DICHOTOMOUS DATA 243

Table 13.5 Hypothetical Fourfold Tables, Problems collected using any of the three major study designs:
with Phi Coefficient cross-sectional, prospective or retrospective. Consider a
bivariate population with underlying multinomial proba-
Y bilities given in table 13.2. The odds ratio, sometimes
X Positive Negative Total
referred to as the cross-product ratio, associating X and Y
is equal to
Second Study  
Positive 45 5 50   ______
11 22 .
1221 (13.17)
Negative 120 30 150
Total 165 35 200 If the observed multinomial frequencies are as dis-
played in table 13.3, the maximum likelihood estimator
Third Study of  is
Positive 90 10 100
Negative 80 20 100 n11n22
o  _____
n12n21
. (13.18)
Total 170 30 200

source: Authors compilation. Suppose that the study design calls for n1 units to be
note: Data for the original study are in table 13.4. sampled from the population positive on X, and for n2
units to be sampled from the population negative on X.
P(Y | X), the conditional probability that Y is positive
than cross-sectional (multinomial and simple random given that X is positive, is equal to 11 / 1 , and the odds
sampling are synonyms) is used. The two fourfold tables for Y being positive, conditional on X being positive, are
in table 13.5, plus the original one in table 13.4, illustrate equal to
this as well as the previous problem. The associations are
identical in the three tables in the sense that the estimated Odds (Y  X)  P(Y  X) / P(Y X)
conditional probabilities that Y is positive, given that X is  (11 /1)/(12 /1)  11 /12.
positive, are all equal to 0.90, and the estimated condi-
tional probabilities that Y is positive, given that X is nega- Similarly,
tive, are all equal to 0.80. If these tables summarized the
results of three studies in which the numbers of subjects Odds (Y  X)  P(Y  X) / P(Y X)
positive on X and negative on X were sampled in the ratio
150:50 in the first study, 50:150 in the second and 100:100  (21 /2)/(22 /2)  21 /22.
in the third, the problem becomes apparent when the
three phi coefficients are calculated. Unlike the other The odds ratio is simply the ratio of these two odds
three measures of association studied in this chapter, values (see equation 13.3),
the values of  vary across the three studies, from 0.13
Odds(Y X)  /  
to 0.11 to 0.14. The important conclusion is not that the   _____________  ______
11 12
 ______
11 22 .
1221 (13.19)
values of  are nearly equal one to another, but that the

Odds(Y X)  / 21 22

values are not identical (for further discussion of the phi


The odds ratio is therefore estimable from a study in
coefficient, see Bishop, Fienberg, and Holland 1975,
which prespecified numbers of units positive on X and
38083).
negative on X are selected for a determination of their
status on Y.
13.5 ODDS RATIO A similar analysis shows that the same is true for the
odds ratio associated with a study in which prespecified
We have seen that the phi coefficient is estimable only numbers of units positive on Y and negative on Y are se-
from data collected in a cross-sectional study and that the lected for a determination of their status on X. Because
simple difference and the rate ratio are estimable only the latter two study designs correspond to prospective
from data collected using a cross-sectional or prospective and retrospective sampling, it is clear that the odds ratio
study. The odds ratio, however, is estimable from data is estimable using data from either of these two designs,
244 STATISTICALLY DESCRIBING STUDY OUTCOMES

as well as from a cross-sectional study. Other properties


of the odds ratio are discussed in the final subsection of 13.5.1 A Single Fourfold Table
this section. Thus, for data arrayed as in table 13.3, equations 13.18
Unlike the meanings of the measures  and RR, the and 13.20 provide the point estimate of the odds ratio and
meaning of the odds ratio is not intuitively clear, perhaps the estimated standard error of its logarithm. To reduce
because the odds value itself is not. Consider, for exam- the bias caused by one or more small cell frequencies
ple, the quantity (note that neither ln(o) nor SE is defined when a cell fre-
quency is equal to zero), authors in the past have recom-
Odds (Y X)  11 / 12 . mended that one add 0.5 to each cell frequency before
proceeding with the analysis (Gart and Zweifel 1967).
The odds value is thus the ratio, in the population of For the frequencies in table 13.4, the values of the several
individuals who are positive on X, of the proportion who statistics calculated with (versus without) the adjustment
are positive on Y to the proportion who are negative. factor are o  2.25 (versus 2.27), ln (o)  0.81 (versus
Given the identities 0.82), and SE  0.4462 (versus 0.4380). The effect of
adjustment is obviously minor in this example. More re-
P(Y X)  Odds (Y X) / (1 Odds (Y X)) cently, other authors have called into question the prac-
tice of addition of 0.5 to cell counts, suggesting instead
and the complementary that other values may be preferred based on improved sta-
tistical properties (Sweeting, Sutton, and Lambert 2004).
Odds (Y X)  P(Y X) / (1 P(Y X)), For example, Michael Sweeting and his colleagues sug-
gest using a function of the reciprocal of the sample size
it is clear that the information present in Odds (Y | X) of the opposite treatment arm. That approach reduces the
is identical to the information present in P(Y | X). bias that may be associated with the use of a constant cor-
Nevertheless, the impact of for every 100 subjects nega- rection factor of 0.5. These authors also suggest Bayesian
tive on Y (all of whom are positive on X), 100  Odds analyses as yet another alternative. In the context of a
(Y | X) is the number expected to be positive on Y meta-analysis, the need to add any constant to cells with a
may be very different from that of out of every 100 sub- zero cell count has also been called into question, as there
jects (all of whom are positive on X), 100  P (Y | X) are alternative methods that yield more accurate estimates
is the number expected to be positive on Y. Once one with improved confidence intervals (Bradburn et al. 2007;
develops an intuitive understanding of an odds value, it Sweeting, Sutton, and Lambert 2004).
quickly becomes obvious that two odds values are natu-
rally contrasted by means of their ratiothe odds ratio.
The functional form of the maximum likelihood esti- 13.5.2 An Alternative Analysis: (O E ) / V
mator of  is the same for each study design (see equation In their meta-analysis of randomized clinical trials of a
13.18). Remarkably, the functional form of the estimated class of drugs known as beta blockers, Salim Yusuf and
standard error is also the same for each study design. For his colleagues used an approach that has become popular
reasons similar to those given in section 13.3.2, it is cus- in the health sciences. It may be applied to a single four-
tomary to perform statistical analyses on ln(o), the natu- fold table as well as to several tables, such as in a study
ral logarithm of the sample odds ratio, rather than on o incorporating stratification (1985). Consider the generic
directly. Tosiya Sato, however, obtained a valid confi- table displayed in table 13.3, and select for analysis any
dence interval for the population odds ratio without trans- of its four cells (the upper left hand cell is traditionally
forming to logarithms (1990). The large sample standard the one analyzed). The observed frequency in the selected
error of ln(o), due to Barnet Woolf (1955), is given by the cell is
equation
O  n11, (13.21)
(
1
SE  ___ 1
___ 1
___ 1
___
n11  n12  n21  n22 )
1/2
(13.20)
the expected frequency in that cell, under the null hypoth-
for each of the study designs considered here. esis that X and Y are independent and conditional on the
EFFECT SIZES FOR DICHOTOMOUS DATA 245

marginal frequencies being held fixed, is a measure of association identical to (O  E ) / V, the stan-
n n
dardized difference.
E  _____
1 1
n
, (13.22) Under many circumstances, given the simplicity and
theoretical superiority of the more traditional methods de-
and the exact hypergeometric variance of n11, under the scribed in the preceding and following subsections of sec-
same restrictions as above, is tion 5, there might be no compelling reason for the
n n n n
(O  E ) / V method to be used. However, more recent sim-
V  _________
1 2 1 2 .
n 2(n  1)
(13.23) ulation studies have demonstrated that, when the margins

of the table are balanced, and the odds ratio is not large, the
When  is close to unity, a good estimator of ln () is (O E)/V method performs well, especially relative to use
of a constant as a correction factor (Bradburn et al. 2006).
OE
L  ______
V
, (13.24)
13.5.3 Inference in the Presence of Covariates
with an estimated standard error of
__ Covariates (that is, variables predictive of the outcome
SE  1/V . (13.25) measure) are frequently incorporated into the analysis of
categorical data. In randomized intervention studies, they
For the frequencies in table 13.4, are sometimes adjusted for to improve the precision with
which key parameters are estimated and to increase the
135 131.25
L  ___________
4.1222  0.9097 power of significance tests. In nonrandomized studies,
they are adjusted for to eliminate the bias caused by con-
[which corresponds to an estimated odds ratio of founding, that is, the presence of characteristics associ-
exp(0.9097)  2.48], with an estimated standard error of ated with both the exposure variable and the outcome
______ variable. There are three major classes of methods for
SE 1/4.1222  0.49. controlling for the effects of covariates: regression adjust-
ment, stratification and matching. Each is considered.
In this example, what has become known in the health 13.5.3.1 Regression Analysis Let Y represent the
sciences as the Peto method, or sometimes as the Yusuf- binary response variable under study (Y1 when the
Peto method, or as the (O  E ) / V method, produces over- response is positive, Y  0 when it is negative), let X rep-
estimates of the point estimate and its standard error. resent the binary exposure variable (X  1 when the study
The method is capable of producing underestimates as unit was exposed, X  0 when it was not), and let Z1, . . . ,
well as overestimates of the underlying parameter. For Zp represent the values of p covariates. A popular statisti-
the frequencies from the two studies summarized in table cal model for representing the conditional probability of
13.5, the respective values of L are 0.6892 (yielding 1.99 a positive response as a function of X and the Zs is the
as the estimated odds ratio) and 0.7804 (yielding 2.18 as linear logistic regression model,
the estimated odds ratio). The latter estimate, coming
from a study in which one of the two sets of marginal   P(Y  1 X, Z1, . . . , Zp)
frequencies was uniform, comes closest of all (for the fre- 1
 ___________________
1 e(  X   Z    Z )
. (13.26)
quencies in tables 13.4 and 13.6) to the correct value 1 1 p p

of 2.25. Sander Greenland and Alberto Salvan showed This model is such that the logit or log odds of ,
that the (O  E ) / V method may yield biased results when
there is serious imbalance on both margins (1990). Fur-  ,
logit()  ln_____
1 
(13.27)
thermore, they showed that the method may yield biased
results when the underlying odds ratio is large, even when is linear in the independent variables:
there is balance on both margins. These findings have
been confirmed in subsequent work (Sweeting et al. 2004; logit()    X  1Z1  . . .  pZp (13.28)
Bradburn et al. 2006). Earlier, Nathan Mantel, Charles
Brown, and David Byar (1977) and Joseph Fleiss, Bruce (for a thorough examination of logistic regression analy-
Levin, and Myunghee C. Paik (2003) pointed out flaws in sis, see Hosmer and Lemeshow 2000).
246 STATISTICALLY DESCRIBING STUDY OUTCOMES

The coefficient  is the difference between the log Table 13.6 Stratified Comparison of Two Treatments
odds for the population coded X  1 and the log odds for
the population coded X  0, adjusting for the effects of Outcome
the p covariates (that is,  is the logarithm of the adjusted Treatment Success Failure Total
odds ratio). The antilogarithm, exp( ), is therefore the
odds ratio associating X with Y. Every major package of Stratum 1 Experimental 4 0 4
computer programs has a program for carrying out a lin- Control 0 1 1
ear logistic regression analysis using maximum likeli-
Total 4 1 5
hood methods. The program produces , the maximum
likelihood estimate of the adjusted log odds ratio, and SE, Stratum 2 Experimental 7 4 11
the standard error of . Inferences about the odds ratio Control 3 8 11
itself may be drawn in the usual ways. Total 10 12 22
13.5.3.2 Mantel-Haenszel Estimator When the p Stratum 3 Experimental 1 0 1
covariates are categorical (for example, sex, social class, Control 4 9 13
and ethnicity), or are numerical but may be partitioned
into a set of convenient and familiar categories (for ex- Total 5 9 14
ample, age twenty to thirty-four, thirty-five to forty-nine, source: Authors compilation.
and fifty to sixty-four), their effects may be controlled for
by stratification. Each subject is assigned to the stratum
defined by his or her combination of categories on the p Table 13.7 Quantities to Calculate the Pooled Log
prespecified covariates, and, within each stratum, a four- Odds Ratio
fold table is constructed cross-classifying subjects by
their values on X and Y. Within a study, the strata defined Stratum Ls SEs Ls/SE2s 1/SE2s
for an analysis will typically be based on the kinds of
covariates just described (for example, sex, social class, 1 3.2958 2.2111 0.6741 0.2045
2 1.3981 0.8712 1.8421 1.3175
and ethnicity). Of note, though, in performing a meta- 3 1.8458 1.7304 0.6164 0.3340
analysis, the covariate of interest is the study, that is, the
stratification is made by study and a fourfold table is con- Total 3.1326 1.8560
structed within each study. Because the detailed formulae source: Authors compilation.
for this stratified approach are presented in chapter 14 of
this volume (formulas 14.14 and 14.15), only the princi-
ples are presented here. is also presented in chapter 14 as formula 14.16 (1986;
The frequencies in table 13.6 are adapted from a com- see also Robins, Greenland, and Breslow 1986). For the
parative study reported by Peter Armitage (1971, 266). data under analysis (in table 13.6), oMH  7.3115, as above,
The associated relative frequencies are tabulated in table and SE  0.8365. An approximate 95 percent confidence
13.7. Within each of the three strata, the success rate for interval for ln() is: ln(7.3115)  1.96  0.8365, or the
the subjects in the experimental group is greater than the interval from 0.35 to 3.63. The corresponding interval for
success rate for those in the control group. The formula  extends from 1.42 to 37.7.
for the summary odds ratio is attributable to Nathan Man- 13.5.3.3 Combining Log Odds Ratios The pooling
tel and William Haenszel (1959). In effect, the stratified of log odds ratios across the S strata is a widely used
estimate is a weighted average of the within-stratum odds alternative to the Mantel-Haenszel method of pooling.
ratios, with weights proportional to the inverse of the The pooled log odds ratio is explicitly a weighted average
variance of the stratum-specific odds ratio. (For a meta- of the stratum-specific log odds ratios, with weights equal
analysis, again, the strata are defined by the studies, but to the inverse of the variance of the stratum-specific esti-
the mathematical procedures are identical.) By the for- mate of the log odds ratio.
mula presented in chapter 14 (see 14.14 or 14.15), the The reader will note, for the frequencies in table 13.6,
value of the Mantel-Haenszel odds ratio is oMH  7.31. that the additive adjustment factor of 0.5, or one of the
James Robins, Norman Breslow, and Sander Green- other options defined by Michael Sweeting, Alex Sutton,
land derived the non-null standard error of ln (oMH), which and Paul Lambert (2004), must be applied for the log
EFFECT SIZES FOR DICHOTOMOUS DATA 247

odds ratios and their standard errors to be calculable Table 13.8 Notation for Observed Frequencies Within
in strata 1 and 3. Summary values appear in table 13.7. Typical Matched Set
They
__ yield a point estimate for the log odds ratio of
L  1.6878, and a point estimate for the odds ratio of Outcome Characteristic
oL  exp(1.6878)  5.41. This estimate is appreciably less Group Positive Negative Total
than the Mantel-Haenszel estimate of oMH  7.31. As all
sample sizes increase, the two estimates converge__ to the 1 am bm tm1
same value. The estimated standard error of L is equal to 2 cm dm tm2
SE  0.7340. The (grossly) approximate 95 percent con-
Total am  cm bm  dm tm
fidence interval for  derived from these values extends
from 1.28 to 22.8. source: Authors compilation.
13.5.3.4 Exact Stratified Method The exact method
estimates the odds ratio by using permutation methods
under the assumption of a conditional hypergeometric re- equal numbers of subjects in both treatment arms, and the
sponse within each stratum (Mehta, Patel, and Gray intervention (or exposure) effects are not large, the strati-
1985). It is the stratified extension of Fishers exact test in fied version of the one-step (Peto or Yusuf-Peto) method
which the success probabilities are allowed to vary among also performs surprisingly well, actually outperforming
strata. The theoretical basis is provided by John Gart other methods, including exact methods, in situations in
(1970), but the usual implementation is provided by net- which the data are very sparse (Sweeting, Sutton, and
work algorithms such as those developed by Cyrus Mehta Lambert 2004; Bradburn et al. 2006).
for the computer program StatXact, as described in the 13.5.3.6 Control by Matching Matching, the third
manual for that program (Cytel Software 2003). and final technique considered for controlling for the ef-
13.5.3.5 Comparison of Methods When the cell fre- fects of covariates, calls for the creation of sets of sub-
quencies within the S strata are all large, the Mantel- jects similar one to another on the covariates. Unlike
Haenszel estimate will be close in value to the estimate stratification, in which the total number of strata, S, is
obtained by pooling the log odds ratios, and either method specifiable in advance of collecting ones sample, with
may be used (see Fleiss, Levin, and Paik 2003). As seen matching one generally cannot anticipate the number of
in the preceding subsection, close agreement may not be matched sets one will end up with. Consider a study with
the case when some of the cell frequencies are small. In p  2 covariates, sex and age, and suppose that the inves-
such a circumstance, the Mantel-Haenszel estimator is tigators desire close matching on age - say ages no more
superior to the log odds ratio estimator (Hauck 1989). than one year apart. If the first subject enrolled in the
Note that there is no need to add 0.5 (or any other con- study is, for example, a twelve-year-old male, then the
stant) to a cell frequency of zero in order for a stratums next male between eleven and thirteen, whether from
frequencies to contribute to the Mantel-Haenszel estima- group 1 or group 2, will be matched to the first subject. If
tor. In fact, it would be incorrect to add any constant to one or more later subjects are also males within this age
the cell frequencies when applying the Mantel-Haenszel span, they should be added to the matched set that was
method. The method must be applied to the cell count already constituted, and should not form the bases for
values directly. new sets. If M represents the total number of matched
Others have compared competing estimators of the sets, and if the frequencies within set m are represented as
odds ratio in the case of a stratified study (Agresti 1990, in table 13.8, all that is required for the inclusion of this
23537; Gart 1970; Kleinbaum, Kupper, and Morgen- set in the analysis is that tm1  1 and tm2  1. The values
stern 1982, 35051; McKinlay 1975; Sweeting et al. tm1  tm2  1 for all m correspond to pairwise matching,
2004; Bradburn et al. 2006). There is some consensus and the values tm1  1 and tm2  r for all m correspond to
that the Mantel-Haenszel procedure is the method of r:1 matching. Here, such convenient balance is not neces-
choice, though logistic regression may also be applied. In sarily assumed. The reader is referred to David Klein-
situations where data are sparse, simulation studies have baum, Lawrence Kupper, and Hal Morgenstern (1982,
demonstrated reasonable performance by the logistic re- chap. 18) or to Kenneth Rothman and Sander Greenland
gression and Mantel-Haenszel methods (Sweeting et al. (1998, chap. 10) for a discussion of matching as a strat-
2004; Bradburn et al. 2006). When there are approximately egy in study design.
248 STATISTICALLY DESCRIBING STUDY OUTCOMES

Table 13.9 Results of Study of Association Between Let D represent the number of matched pairs that are dis-
Use of Estrogens and Endometrial Cancer cordant in the direction of the case having been exposed
and the control not, and let E represent the number of
Matched Sets matched pairs that are discordant in the other direction.
am bm cm dm tm with Given Pattern The Mantel-Haenszel estimator of the underlying odds
ratio is simply oMH, matched  D/E, and the estimated stan-
1 0 0 3 4 1 dard error of ln (oMH, matched) simplifies to
1 0 1 2 4 3
1 0 0 4 5 4
DE
1 0 1 3 5 17 DE(
SE  ______ ) 1/2
. (13.30)
1 0 2 2 5 11
1 0 3 1 5 9
1 0 4 0 5 2 13.5.4 Reasons for Analyzing the Odds Ratio
0 1 0 4 5 1
0 1 1 3 5 6 The odds ratio is the least intuitively understandable mea-
0 1 2 2 5 3 sure of association of those considered in this chapter, but
0 1 3 1 5 1 it has a great many practical and theoretical advantages
0 1 4 0 5 1 over its competitors. One important practical advantage,
source: Authors compilation.
that it is estimable from data collected according to a
number of different study designs, is demonstrated in
section 13.5. A second, illustrated in section 13.5.3.1, is
The Mantel-Haenszel estimator of the odds ratio is that the logarithm of the odds ratio is the key parameter in
given by the adaptation of equation 14.15 to the notation a linear logistic regression model, a widely used model
in table 13.8: for describing the effects of predictor variables on a bi-
nary response variable. Other important features of the
amdm / tm
oMH, matched  _________ . (13.29) odds ratio will now be pointed out.

b c /t
m m m
If X represents group membership (coded 0 or 1), and
Several studies have derived estimators of the stan- if Y is a binary random variable obtained by dichotomiz-
dard error of ln(oMH, matched) in the case of matched sets ing a continuous random variable Y* at the point y, then
(Breslow 1981; Connett et al. 1982; Fleiss 1984; Robins, the odds ratio will be independent of y if Y* has the cu-
Breslow, and Greenland 1986; Robins, Greenland, and mulative logistic distribution,
Breslow 1986). Remarkably, formula 14.16, with the ap-
propriate changes in notation, applies to matched sets as P(Y*
y* X)  _____________
1
1 e(  X  y*)
, (13.31)
well as to stratified samples. The estimated standard error
of ln (oMH, matched) is identical to formula 14.16 for meta- for all y*. If, instead, a model specified by two cumulative
analyses (Robins, Breslow, and Greenland 1986). normal distributions with the same variance is assumed,
The frequencies in table 13.9 have been analyzed by the odds ratio will be nearly independent of y (Edwards
several scholars (Mack et al. 1976; Breslow and Day 1966; Fleiss 1970).
1980, 17682; Fleiss 1984). There were, for example, Many investigators automatically compare two pro-
nine matched sets consisting of one exposed case, no un- portions by taking their ratio, rr. When the proportions
exposed cases, three exposed controls and one unexposed are small, the odds ratio provides an excellent approxi-
control. For the frequencies in the table, oMH, matched  5.75, mation to the rate ratio. Assume, for example, that
ln(oMH, matched)  1.75, and the estimated standard error of P1  0.09 and P2  0.07. The estimated rate ratio is
ln (oMH, matched) is SE  0.3780. The resulting approximate
95 percent confidence interval for the value in the popula- rr  P1/P2  0.09/0.07 1.286,
tion of the adjusted odds ratio extends from exp (1.75 
1.96  0.3780)  exp (1.0091)  2.74 to exp (1.75  and the estimated odds ratio is nearly equal to it,
1.96  0.3780)  exp (2.4909)  12.07.
An important special (and simple) case is that of P (1 P ) 0.09  0.93
matched pairs. This analysis simplifies to the following. o  _________
1 2 __________
P (1 P )  0.07  0.91 1.314.
2 1
EFFECT SIZES FOR DICHOTOMOUS DATA 249

A practical consequence is that the odds ratio estimated A final property of the odds ratio with great practical
from a retrospective studythe kind of study from which importance is that it can assume any value between 0 and
the rate ratio is not directly estimablewill, in the case , no matter what the values of the marginal frequencies
of relatively rare events and in the absence of bias, pro- and no matter what the value of one of the two probabili-
vide an excellent approximation to the rate ratio. ties being compared. As we learned earlier, neither the
Two theoretically important properties of the odds ratio simple difference nor the rate ratio nor the phi coefficient
pertain to underlying probability models. Consider, first, is generally capable of assuming values throughout its
the sampling distribution of the frequencies in table 13.3, possible range. We saw for the rate ratio, for example,
conditional on all marginal frequencies being held fixed that the parameter value was restricted to the interval
at their observed values. When X and Y are independent, 0
RR
2.5 when the probability 2 was equal to 0.40.
the sampling distribution of the obtained frequencies is The odds ratio can assume any nonnegative value, how-
given by the familiar hypergeometric distribution. When ever. Given that
X and Y are not independent, however, the sampling dis-
 (1  ) 
tribution, referred to as the noncentral hypergeometric   _________
1 2 _______
 (1  ) 1.5 (1  )
1

2 1 1
distribution, is more complicated. This exact conditional
distribution depends on the underlying probabilities only when 2  0.40,   0 when 1  0 and   when
through the odds ratio: 1 1.
Consider, as another example, the frequencies in table
Pr(n11, n12, n21, n22 n1, n2n1n2, ) 13.4. With the marginal frequencies held fixed, the simple
n1!
______ n2!
______ difference may assume values only in the interval 0.17

n !n ! n !n ! 
n11

 _______________________________
11 12 21 22 , (13.32) D
0.50 (and not 1
D
1); the rate ratio with Posi-
n ! n !
_________
1
_____________________
x!(n  x)! (n x)!(n n n x)!
2 x tive as the untoward outcome may assume values only in
x 1 1 1 1
the interval 0.83
RR
2.0 (and not 0
RR
); and the
where the summation is over all permissible integers in phi coefficient may assume values only in the interval
the upper left hand cell (Agresti 1990, 6667). 0.22

0.65 (and not 1

1); but the odds ratio
The odds ratio also plays a key role in loglinear mod- may assume values throughout the interval 0

. (To
els. The simplest such model is for the fourfold table. be sure, the rate ratio may assume values throughout the
When the row and column classifications are indepen- interval 0
RR
 when Negative represents the untow-
dent, the linear model for the natural logarithm of the ex- ard outcome.)
pected frequency in row i and column j is The final example brings us back to table 13.1. We saw
earlier that there was heterogeneity in the association
ln (E(nij) )    i  j , (13.33) between exposure to asbestos and mortality from lung
cancer, when association was measured by the simple dif-
with the parameters subject to the constraints 1  2  ference between two rates:
1  2  0. In the general case, dependence is modeled
by adding additional terms to the model: Dsmokers  601.6 122.6
 479.0 deaths per 100,000 person years
ln (E(nij) )    i  j  ij , (13.34)
and
with  1j   2j   i1   i2  0. Let  11. Thanks to
these constraints,   12   21  22, and the associa- Dnonsmokers  58.4 11.3
tion parameter is related to the odds ratio by the simple
 47.1 deaths per 100,000 person years.
equation
This appearance of heterogeneous association essen-
  __14 ln(). (13.35)
tially vanishes when association is measured by the odds
Odds ratios and their logarithms represent descriptors ratio:
of association in more complicated loglinear models as 601.6  (100,000 122.6)
well (Agresti 1990, chap. 5). osmokers  ______________________
122.6  (100,000  601.6)
 4.93
250 STATISTICALLY DESCRIBING STUDY OUTCOMES

and similar pattern for referral of women (by 85 percent of


physicians) versus men (by 91 percent of physicians). On
58.4  (100,000 11.3)
ononsmokers  ____________________
11.3  (100,000  58.4)
 5.17. closer examination of the data, black women were the
least likely to be referred, with all others being about as
Whether one is considering smokers or nonsmokers, likely as each other to be referred.
the odds that a person exposed to asbestos will die of lung It is possible, however, and potentially useful, to per-
cancer are approximately five times those that a person form analyses on the odds ratio scale and then convert the
not so exposed will. In fairness, the rate ratio in this ex- results back into terms of either risk ratios or risk differ-
ample also has a relatively constant value for smokers ences (Localio, Margolis, and Berlin 2007). Such an ap-
(rr  601.6 / 122.6  4.91) and for nonsmokers (rr  58.4 / proach would be particularly useful when adjustment for
11.3  5.17), also illustrating the similarity of the values multiple covariates in a logistic regression model is called
of the odds ratio and the rate ratio for this rare outcome. for in an analysis. In the simplest case, the relationship
As a result, we see that the odds ratio is not prone to the between the two measures is captured by:
artifactual appearance of interaction across studies due to

the influence on other measures of association or effect of rr  ___________
(1  )( )
. (13.36)
1 1
varying marginal frequencies or to constraints on one or
the other sample proportion. On the basis of this and all One recommended approach, using this relationship,
of its other positive features, the odds ratio is recom- takes the estimate of the proportion in the unexposed or
mended as the measure of choice for measuring effect or untreated group (1) from the data, as the raw risk in the
association when the studies contributing to the research reference group, and takes the estimated odds ratio from
synthesis are summarized by fourfold tables. a logistic regression model. The upper and lower confi-
dence limits for the rate ratio are calculated using the
13.5.5 Conversion to Other Measures same relationship, substituting the respective values from
the confidence interval for the odds ratio. That approach,
The discussion of the choice of an effect measure is com- known as the method of substitution, has subsequently
plicated by the mentioned difficulty in the interpretation been criticized for its failure to take into account the vari-
of the odds ratio. Despite its favorable mathematical ability in the baseline risk of 1 (McNutt, Hafner, and
properties, the odds ratio remains less than intuitive. In Xue 1999; McNutt et al. 2003).
addition, though the odds ratio nicely reproduces the rate It is also possible to estimate adjusted relative risks
ratio (a more intuitive quantity) when the event of interest using alternative models, for example, log-binomial mod-
is uncommon (for example, for the data in table 13.1), it els or Poisson regression models. Variations on these
is well known to overestimate the rate ratio when the approaches, however, have been demonstrated to suffer
event of interest is more common (Deeks 1998; Altman, from theoretical, as well as practical, limitations (Localio,
Deeks, and Sackett 1998). The report on the effect of race Margolis, and Berlin 2007). As an alternative, Russell
and sex on physician referrals by Kevin Schulman and his Localio and his colleagues proposed the use of logistic
colleagues, illustrates this distortion, which generated regression models, with adjustment for all relevant co-
some controversy in the medical literature (Schulman variates, followed by conversion to the RR using the fol-
et al. 1999; see also, for example, Davidoff 1999). Pre- lowing relationship, which follows directly from equation
sented with a series of clinical scenarios portrayed on 13.26 (2007):
video recordings involving trained actors, physicians 
1 e
were asked to make a decision as to whether to refer the rr  ________
1 e
. (13.37)
given patient for further cardiac testing. Whites were re-
ferred by 91 percent of physicians, compared with Afri- Two challenges present themselves in such conver-
can Americans by 85 percent. The odds ratio of 0.6, sions. One is that equation 13.37 appears to ignore the
meaning the odds of referral for blacks are 60 percent of values of the covariates. In fact, this is a deliberate sim-
the odds of referral for whites, is a mathematically valid plification. From 13.26, it is clear that the value of the
measure of the effect of race on referrals, but the tempta- risk ratio will depend on both the value of the odds ratio
tion to think in terms of blacks being 60 percent as likely and on the value of the baseline risk. In particular, it is
to be referred clearly needs to be resisted. There was a clear that in the logistic model, the risk is specified as
EFFECT SIZES FOR DICHOTOMOUS DATA 251

depending on the values of the covariates. Thus, even grant MH45763 from the National Institute of Mental
though the value of the odds ratio is constant in the lo- Health. Arthur Fleiss kindly provided permission for
gistic model, the value of the relative rate should depend Dr. Berlin to modify this chapter for the current edition.
on the values of the covariates. In 13.37, it is assumed
that the covariates have been centered, that is, that the 13.7 REFERENCES
mean values have been subtracted. The value of the rela-
tive rate calculated in 13.37 is then the value calculated Agresti, Alan. 1990. Categorical Data Analysis. New York:
for the average member of the sample under study. One John Wiley & Sons.
can extend equation 13.37 to include specified values of Altman, Douglas G., Jonathan J. Deeks, and David L. Sac-
the covariates. This would allow, for example, calculating kett. 1998. Odds Ratios Should Be Avoided when Events
relative risks separately for males and females from a are Common. Letter. British Medical Journal 317(7168):
model in which sex is specified as being associated with 1318.
risk of the outcome. Armitage, Peter. 1971. Statistical Methods in Medical Re-
The second challenge in the conversion from the odds search. New York: John Wiley & Sons.
ratio to the relative rate (or risk difference) is variance Bishop, Yvonne M.M., Stephen E. Fienberg, and Paul W.
estimation. As noted, the bootstrap approach is a conve- Holland. 1975. Discrete Multivariate Analysis: Theory and
nient one in situations where variance estimation is other- Practice. Cambridge, Mass.: MIT Press.
wise complicated (Carpenter and Bithell 2000; Efron and Bradburn, Michael J., Jonathan J. Deeks, A. Russell Localio,
Tibshirani 1993; Davison and Hinkley 1997) and this Jesse A. Berlin. 2007. Much Ado About Nothing: A Com-
method is suggested by Localio and his colleagues (2007). parison of the Performance of Meta-Analytical Methods
They also provide a method that doesnt depend on the with Rare Events. Statistics in Medicine 26(1): 5377.
iterative bootstrap approach and can be used when fitting Breslow, Norman E. 1981. Odds Ratio Estimators When
a single logistic regression model. the Data are Sparse. Biometrika 68(1): 7384.
The availability of the conversion approach described Breslow, Norman E., and Nicholas E. Day. 1980. Statistical
here makes it all the more compelling to fit models using Methods in Cancer Research, vol. 1, The Analysis of Case-
odds ratios and then to convert the results onto the ap- Control Studies. Lyons: International Agency for Research
propriate scale to facilitate interpretation by substantive on Cancer.
or policy experts. Carpenter, James, and John Bithell. 2000. Bootstrap Confi-
In some meta-analyses, some of the component studies dence Intervals: When, Which, What? A Practice Guide
may have reported continuous versions, and others di- for Medical Statisticians. Statistics in Medicine 19(9):
chotomized versions of the outcome variables. When the 114164.
majority of meta-analytic studies have maintained the Chinn, Susan. 2000. A Simple Method for Converting an
continuous nature of the outcome variable, and only a few Odds Ratio to Effect Size for Use in Meta-Analysis. Sta-
others have dichotomized it, a practical strategy would be tistics in Medicine 19(22): 312731.
to transform the odds ratio to the d index (described in Carroll, John B. 1961. The Nature of the Data, or How to
detail in chapter 12, this volume) with one of the conver- Choose a Correlation Coefficient. Psychometrika 26(4):
sion formulae proposed and discussed in a number of pa- 34772.
pers (for example, Chinn 2000; Haddock, Rindskopf, and Connett, John, Ejigou, Ayenew, McHugh, Richard, and Bres-
Shadish 1998; Hasselblad and Hedges 1995; Snchez- low, Norman. 1982. The Precision of the Mantel-Haenszel
Meca, Marn-Martnez, and Salvador Chacn-Moscoso Estimator in Case-Control Studies with Multiple Matching.
2003; Ulrich and Wirtz 2004). The reverse strategy could American Journal of Epidemiology 116(6): 87577.
be applied when the majority of the studies present odds Cytel Software Corporation. 2003. StatXact: Cambridge,
ratios and only a few include d indices. Mass.
Davidoff, Frank. 1999. Race. Sex, and Physicians Referral
13.6 ACKNOWLEDGMENTS for Cardiac Catheterization. Letter. New England Journal
of Medicine 341(4): 28586.
Drs. Melissa Begg and Bruce Levin provided valuable Davison, Anthony C., and David V. Hinkley. 1997. Boots-
criticisms of a version of this chapter written for the pre- trap Methods and Their Application. Cambridge: Cam-
vious edition. The original work was supported in part by bridge University Press.
252 STATISTICALLY DESCRIBING STUDY OUTCOMES

Deeks, Jonathan J. 1998. When Can Odds Ratio Mislead? Hauck, Walter W. 1989. Odds Ratio Inference from Strati-
Letter. British Medical Journal 317(7166): 1155. fied Samples. Communications in Statistics 18A(2): 767
. 2002. Issues in the Selection of a Summary Statis- 800.
tic for Meta-Analysis of Clinical Trials with Binary Outco- Hosmer, David W., and Stanley Lemeshow. 2004. Applied
mes. Statistics in Medicine 21(11): 15751600. Logistic Regression, 2nd ed. New York: John Wiley & Sons.
Edwards, John H. 1966. Some Taxonomic Implications of a Hunter, John E., and Frank L. Schmidt. 1990. Dichotomi-
Curious Feature of the Bivariate Normal Surface. British zation of Continuous Variables: The Implications for Meta-
Journal of Preventive and Social Medicine 20(1): 4243. Analysis. Journal of Applied Psychology 75: 33449.
Efron, Bradley, and Rob J. Tibshirani. 1993. An Introduction Kleinbaum, David G., Lawrence L. Kupper, and Hal Mor-
to the Bootstrap. New York: Chapman and Hall. genstern. 1982. Epidemiologic Research: Principles and
Engels, Eric A., Christopher H. Schmid, Norma Terrin, In- Quantitative Methods. Belmont, Calif.: Lifetime Learning
gram Olkin, and Joseph Lau. 2000. Heterogeneity and Publications.
Statistical Significance in Meta-Analysis: An Empirical Localio, A. Russell, D. J. Margolis, Jesse A. Berlin. 2007.
Study of 125 Meta-Analyses. Statistics in Medicine 19(13): Relative Risks and Confidence Intervals Were Easily
170728. Compared Indirectly from Multivariable Logistic Regres-
Fleiss, Joseph L. 1970. On the Asserted Invariance of the sion. Journal of Clinical Epidemiology 60(9): 7482.
Odds Ratio. British Journal of Preventive and Social Me- Mack, Thomas M., Malcolm C. Pike, Brian E. Henderson,
dicine 24(1): 4546. R. I. Pfeffer, V. R. Gerkins, B. S. Arthur, and S. E. Brown.
. 1984. The Mantel-Haenszel Estimator in Case- 1976. Estrogens and Endometrial Cancer in a Retirement
Control Studies with Varying Numbers of Controls Mat- Community. New England Journal of Medicine 294(23):
ched to Each Case. American Journal of Epidemiology 126267.
120(1): 13. Mantel, Nathan, Charles Brown, and David P. Byar. 1977.
Fleiss, Joseph L., and Mark Davies. 1982. Jackknifing Tests for Homogeneity of Effect in an Epidemiologic In-
Functions of Multinomial Frequencies, with an Application vestigation. American Journal of Epidemiology 106(2):
to a Measure of Concordance. American Journal of Epide- 12529.
miology 115(6): 84145. Mantel, Nathan, and William Haenszel. 1959. Statistical
Fleiss, Joseph L., Bruce Levin, and Myunghee C. Paik. 2003. Aspects of the Analysis of Data from Retrospective Studies
Statistical Methods for Rates and Proportions, 3rd ed. New of Disease. Journal of the National Cancer Institute 22(4):
York: John Wiley & Sons. 71948.
Gart, John J., and James R. Zweifel. 1967. On the Bias of McKinlay, Sonja M. 1975. The Effect of Bias on Estima-
Various Estimators of the Logit and its Variance, with Ap- tors of Relative Risk for Pair-Matched and Stratified Sam-
plication to Quantal Bioassay. Biometrika 54(1): 18187. ples. Journal of the American Statistical Association
Gart, John J. 1970. Point and Interval Estimation of the 70(352): 85964.
Common Odds Ratio in the Combination of 2  2 Tables McNutt Louise A., Jean P. Hafner, and Xue Xiaonan. 1999.
with Fixed Marginals. Biometrika 57(3): 47175. Correcting the Odds Ratio in Cohort Studies of Common
Greenland, Sander, and Alberto Salvan. 1990. Bias in the Outcomes. Letter. Journal of the American Medical Asso-
One-Step Method for Pooling Study Results. Statistics in ciation 282(6): 529.
Medicine 9(3): 24752. McNutt, Louise A., Chuntao Wu, Xiaonan Xue, and Jean P.
Haddock, C. Keith, David Rindskopf, and William R. Sha- Hafner. 2003. Estimating the Relative Risk in Cohort Stu-
dish. 1998. Using Odds Ratios as Effect Sizes for Meta- dies and Clinical Trials of Common Outcomes. American
Analysis of Dichotomous Data: A Primer on Methods and Journal of Epidemiology 157(10): 94043.
Issues. Psychological Methods 3(3): 33953. Mehta, Cyrus, Nitin R. Patel, and Robert Gray. 1985. On
Hammond, E. Cuyler, Irving J. Selikoff, and Herbert Seid- Computing an Exact Confidence Interval for the Common
man. 1979. Asbestos Exposure, Cigarette Smoking and Odds Ratio in Several 22 Contingency Tables. Journal of
Death Rates. Annals of the New York Academy of Sciences the American Statistical Association 80(392): 96973.
330: 47390. Quenouille, M. H. 1956. Notes on Bias in Estimation. Bio-
Hasselblad, Victor, and Larry V. Hedges. 1995. Meta-Ana- metrika 43(34): 35360.
lysis of Screening and Diagnostic Tests. Psychological Robins, James, Norman Breslow, and Sander Greenland. 1986.
Bulletin 117(10: 16778. Estimators of the Mantel-Haenszel Variance Consistent in
EFFECT SIZES FOR DICHOTOMOUS DATA 253

Both Sparse Data and Large-Strata Limiting Models. Bio- for Cardiac Catheterization. New England Journal of Me-
metrics 42(2): 1123. dicine 340(8): 61826.
Robins, James., Sander Greenland, and Norman E. Breslow. Sweeting, Michael J., Alex J. Sutton, and Paul C. Lambert.
1986. A General Estimator for the Variance of the Mantel- 2004. What to Add to Nothing? Use and Avoidance of
Haenszel Odds Ratio. American Journal of Epidemiology Continuity Corrections in Meta-Analysis of Sparse Data.
124(5): 71923. Statistics in Medicine 23(9): 135175.
Rothman, Kenneth J., and Sander Greenland. 1998. Modern Tukey, John W. 1958. Bias and Confidence in Not-Quite-
Epidemiology, 2nd ed. Philadelphia, Pa.: Lippincott Williams Large Samples. Annals of Mathematical Statistics 29: 614.
and Wilkins. Ulrich, Rolf, and Markus Wirtz. 2004. On the Correlation
Snchez-Meca, Julio, Fungencio Marn-Martnez, and Sal- of a Naturally and an Artificially Dichotomized Variable.
vador Chacn-Moscoso. 2003. Effect-Size Indices for Di- British Journal of Mathematical and Statistical Psychology
chotomized Outcomes in Meta-Analysis. Psychological 57(2): 23551.
Methods 8(4): 44867. Woolf, Barnet. 1955. On Estimating the Relation Between
Sato, Tosiya. 1990. Confidence Limits for the Common Blood Group and Disease. Annals of Human Genetics
Odds Ratio Based on the Asymptotic Distribution of the 19(4): 25153.
Mantel-Haenszel Estimator. Biometrics 46: 7180. Yusuf, Salim, Richard Peto, John Lewis, Rory Collins, and
Schulman Kevin A., Jesse A. Berlin, William Harless, Jon F. Peter Sleight. 1985. Beta Blockade During and After Myo-
Kerner, Shyrl Sistrunk, Benard J. Gersh et al. 1999. The cardial Infarction: An Overview of the Randomized Trials.
Effect of Race and Sex on Physicians Recommendations Progress in Cardiovascular Diseases 27(5): 33571.
14
COMBINING ESTIMATES OF EFFECT SIZE

WILLIAM R. SHADISH C. KEITH HADDOCK


University of California, Merced University of Missouri, Kansas City

CONTENTS
14.1 Introduction 258
14.1.1 Dealing with Differences Among Studies 258
14.1.2 Weighting Studies 259
14.1.2.1 Variance and Within-Study Sample Size Schemes 259
14.1.2.2 Reliability and Validity of Measurement 260
14.1.2.3 Other Weighting Schemes 260
14.1.3 Examining Effects of Study Quality After Combining Results 261
14.2 Fixed-Effects Models 261
14.2.1 Fixed EffectsStandardized Mean Differences 264
14.2.2 Fixed EffectsCorrelations 264
14.2.2.1 Fixed Effectsz-Transformed Correlations 264
14.2.2.2 Fixed EffectsCorrelation Coefficients Directly 265
14.2.3 Fixed EffectsFourfold Tables 265
14.2.3.1 Fixed EffectsDifferences Between Proportions 266
14.2.3.2 Fixed EffectsLog Odds Ratios 267
14.2.3.3 Fixed EffectsOdds Ratios with Mantel-Haenszel 267
14.2.4 Combining Data in Their Original Metric 269
14.3 Random-Effects Models 270
14.3.1 Random EffectsStandardized Mean Differences 272
14.3.2 Random EffectsCorrelations 273
14.3.2.1 Random Effectsz-Transformed Correlations 273
14.3.2.2 Random EffectsCorrelation Coefficients Directly 273
14.3.3 Random EffectsFourfold Tables 274
14.3.3.1 Random EffectsDifferences Between Proportions 274
14.3.3.2 Random EffectsOdds and Log Odds Ratios 274
14.3.4 Combining Differences Between Raw Means 274

257
258 STATISTICALLY COMBINING EFFECT SIZES

14.4 Computer Programs 275


14.5 Conclusions 275
14.6 References 276

14.1 INTRODUCTION so the conceptual and statistical issues involved in com-


puting these combined estimates need careful attention.
In 1896, Sir Almroth Wrighta colleague and mentor
of Sir Alexander Fleming, who discovered penicillin 14.1.1 Dealing with Differences Among Studies
developed a vaccine to protect against typhoid (Susser
1977; see Roberts 1989). The typhoid vaccine was tested Combining estimates of effect size from different studies
in several settings, and on the basis of these tests the vac- would be easy if studies were perfect replicates of each
cine was recommended for routine use in the British army otherif they made the same methodological choices
for soldiers at risk for the disease. In that same year, Karl about such matters as within-study sample size, measures,
Pearson, the famous biometrician, was asked to examine or design, and if they all investigated exactly the same
the empirical evidence bearing on the decision. To do so, conceptual issues and constructs. Under the simplest
he synthesized evidence from five studies reporting data fixed-effects model of a single population effect size ,
about the relationship between inoculation status and ty- all these replicates would differ from each other only by
phoid immunity, and six studies reporting data on inocu- virtue of having used just a sample of observations from
lation status and fatality among those who contracted the the total population, so that observed studies would yield
disease. These eleven studies used data from seven inde- effect sizes that differed from the true population effect
pendent samples, with four of those being used twice, size only by sampling error (henceforth we refer to the
once in a synthesis of evidence about incidence of typhoid variance due to sampling error as conditional variance,
among those inoculated, and once in a synthesis of evi- for reasons explained in section 14.3). An unbiased esti-
dence about death among those contracting typhoid. He mate of the population effect would then be the simple
computed tetrachoric correlations for each of these eleven average of observed study effects; and its standard error
cases, and then averaged these correlations (separately for would allow computation of confidence intervals around
incidence and fatality) to describe average inoculation ef- that average. Under random-effects models, one would
fectiveness (see table 14.1). In his subsequent report of not assume that there is just one population effect size;
this research, Karl Pearson concluded that the average instead, a distribution of population effect sizes would
correlations were too low to warrant adopting the vaccine, exist generated by a distribution of possible study realiza-
since other accepted vaccines at that time routinely pro- tions (see chapter 3, this volume). Hence, observed out-
duced correlations at or above .20 to .30: I think the right comes in studies would differ from each other not just
conclusion to draw would be not that it was desirable to because of sampling error, but also because they reflect
inoculate the whole army, but that improvement of the these true, underlying population differences.
serum and method of dosing, with a view to a far higher It is rarely (if ever) the case that studies are perfect
correlation, should be attempted (1904b, 1245). replicates of each other, however, because they almost
We tell this story for three reasons. First, we discovered always differ from each other in many methodological
this example when first researching this chapter in 1991, and substantive ways. Hence, most research syntheses
and at the time it was the earliest example of what we must grapple with certain decisions about how to com-
would now call a meta-analysis. So it is historically inter- bine studies that differ from each other in multiple ways.
esting, although a few older examples have now been lo- Consider, for example, the eleven samples in table 14.1.
cated. Second, we will use the data in table 14.1 to illus- The eleven samples differ dramatically in within-study
trate how to combine results from fourfold tables. Finally, sample size. They also differ in other ways that were ex-
the study illustrates some conceptual and methodological tensively and acrimoniously debated by Karl Pearson and
issues involved in combining estimates of effect size Almroth Wright (Susser 1977). For example, both Wright
across studies, points we return to in the conclusion of this and Pearson recognized that unreliability in inoculation
chapter. In many ways, the capacity to combine results history and typhoid diagnosis attenuated sample correla-
across studies is the defining feature of meta-analysis, tions. Further, clinical diagnoses of typhoid were less reli-
COMBINING ESTIMATES OF EFFECT SIZE 259

Table 14.1 Effects of Typhoid Inoculations on Incidence and Fatality

Inoculated Not Inoculated

N Cases N Cases r Odds Ratio

Incidence
Hospital staffs 297 32 279 75 .373 3.04
Ladysmiths garrison 1,705 35 10,529 1,489 .445 7.86
Methuens column 2,535 26 10,981 257 .191 2.31
Single regiments 1,207 72 1,285 82 .021 1.07
Army in India 15,384 128 136,360 2,132 .100 1.89
Average .226

Lived Died Lived Died

Fatality Among Those


Contracting Typhoid
Hospital staffs 30 2 63 12 .307 2.86
Ladysmith garrison 27 8 1,160 329 .010 .96
Single regiments 63 9 61 21 .300 2.41
Special hospitals 1,088 86 4,453 538 .119 1.53
Various military hospitals 701 63 2,864 510 .194 1.98
Army in India 73 11 1,052 423 .248 2.67
Average .193

source: Adapted with permission from Susser (1977), adapted from Pearson (1904b).

able than autopsy diagnoses, and studies differed in which tics course will recognize the principle underlying these
diagnostic measure they used. One might reasonably con- schemes. Studies with larger within-study sample sizes
sider, therefore, whether and how to take these differences will give more accurate estimates of population parame-
into account. ters than studies with smaller sample sizes. They will be
more accurate in the sense that the variance of the estimate
14.1.2 Weighting Studies around the parameter will be smaller, allowing greater
confidence in narrowing the region in which the popula-
In general, one could deal with such study differences tion value falls. Pearson noted this dependence of confi-
either before or while combining results or subsequent to dence on within-study sample size in the typhoid inocula-
combining results. Weighting schemes that are applied at tion data, stating that many of the groups . . . are far too
the earlier points rest on three assumptions: theory or small to allow of any definite opinion being formed at all,
evidence suggests that studies with some characteristics having regard to the size of the probable error involved
are more accurate or less biased with respect to the de- (1904b, 1243; for explanation of probable error, see Cowles
sired inference than studies with other characteristics, the 1989). In table 14.1, for example, estimates of typhoid
nature and direction of that bias can be estimated prior to incidence from the army-in-India sample were based on
combining, and appropriate weights to compensate for more than 150,000 subjects, more than 200 times as many
the bias can be constructed and justified. Examples of subjects as the hospital staff sample of less than 600. The
such weighting schemes are described in the following large sample used in the army-in-India study greatly re-
three sections. duced the role of sampling error in determining the out-
14.1.2.1 Variance and Within-Study Sample Size come. Hence, all things being equal, such large-sample
Schemes Everyone who has had an undergraduate statis- studies should be weighted more heavily; doing so would
260 STATISTICALLY COMBINING EFFECT SIZES

improve Pearsons unweighted estimates. To implement articles in which they debated the issues. Sometimes rel-
this scheme, one must have access to the within-study evant coefficients can be borrowed from test manuals or
sample sizes used in each of the studies, which is almost reports of similar research. Even when no reasonable reli-
always available. This is by far the most widely accepted ability or validity estimates can be found, however, pro-
weighting scheme, and the only one to be discussed in cedures exist for integrating effect size estimates over
detail in this chapter. studies when some studies are corrected and others are
This weighting scheme fulfills all three assumptions not (Hunter and Schmidt 2004). These and related mat-
outlined previously. First, our preference for studies with ters are addressed by Frank Schmidt, Huy Le, and In-Sue
large within-study sample sizes is based on strong, widely Oh elsewhere in this book (see chapter 17, this volume),
accepted statistical theory. To wit, large-sample studies so we do not discuss them further. However, those proce-
tend to produce sample means that are closer to the popu- dures can be applied concurrently with the variance-based
lation mean, and estimating that population mean is (often) weighting schemes illustrated in this chapter (Hedges and
the desired inference. Second, the nature and direction of Olkin 1985).
uncertainty in studies that have small within-study sam- 14.1.2.3 Other Weighting Schemes No schemes other
ple sizes is cleargreater variability of sample estimates than those discussed above seem clearly to meet the three
symmetrically around the population parameter. Third, conditions previously outlined. Hence, other schemes
statistical theory suggests very specific weights (to be de- should be applied with caution and viewed as primarily
scribed shortly) that are known to maximize the chances exploratory in nature (Boissel et al. 1989, see also chap-
of correctly making the desired inference to the popula- ter 7, this volume). Perhaps the most frequent contender
tion parameter. We know of only one other weighting for an alternative scheme is weighting by study quality
scheme that so unambiguously meets these three condi- (Chalmers et al. 1981). The latter term, of course, means
tions, which we now describe. different things to different people. For example, when
14.1.2.2 Reliability and Validity of Measurement If treatment effectiveness is an issue, one might weight
studies were perfect replicates, they would measure ex- treatment outcome studies according to aspects of design
actly the same variables. This is rarely the case. As a re- that, presumably, facilitate causal inference. Of all such
sult, each different measure, both within and between design features, random assignment of units to conditions
studies, can have different reliability and validity. For ex- has the strongest justification, so it is worth examining it
ample, some studies in table 14.1 used incidence, some against our three assumptions. Strong theory buttresses
used mortality, and others used both measures; and inci- the preference we give to random assignment in treatment
dence and mortality were measured different ways in dif- outcome studies, so it meets the first criterion. But it does
ferent studies. The differential reliability and validity of not clearly meet the other two. We know only that esti-
these measures was a source of controversy between mates from nonrandomized studies may be biased; we
Pearson and Wright. Fortunately, the effects of reliability rarely know the direction or nature of the bias. In fact, the
and validity on assessments of study outcomes are well latter can vary considerably from study to study (Heins-
specified in classical measurement theory. In the simplest man and Shadish 1996; Lipsey and Wilson 1993; Shadish
case, unreliability of measurement attenuates relation- and Ragsdale 1996). Further, theory does not suggest a
ships between two variables, so that univariate correla- very specific weighting scheme that makes it more likely
tion coefficients and standardized mean difference statis- we will obtain a better estimate of the population param-
tics are smaller than they would be with more reliable eter. We might assign all randomized studies somewhat
measures. In multivariate cases, the effects of unreliabil- greater weight than all quasi-experiments. But even this
ity and invalidity can be more complex (Bollen 1989), is arguable because it may be that we should assign all-
but even then they are well known, and specific correc- or-none weights to this dichotomy: If quasi-experiments
tions are available to maximize the chances of making the are biased, allowing them any weight at all may bias the
correct population inference (Hedges and Olkin 1985; combined result in unknown ways. But this dichotomous
Hunter and Schmidt 2004). (0, 1) weighting scheme is not really a weighting scheme
However, unlike within-study sample sizes, information at all because applying it is equivalent to excluding all
about reliability and validity coefficients often will not be quasi-experiments in the first place. Given our present
available in reports of primary studies. Neither Pearson lack of knowledge about random versus nonrandom as-
nor Wright, for example, reported these coefficients in the signment related to our second and third conditions for
COMBINING ESTIMATES OF EFFECT SIZE 261

using weighting schemes, we think it is best to examine where wi is a weight assigned to the ith study that is de-
differences between randomized and quasi-experiments fined in equation 14.2. Equation 14.1 can be adapted, if
after combining effect sizes rather than before. Other qual- desired, to include a quality index for each study, where
ity measures, such as whether the experimenter is blind to qi is the ith studys score on the index, perhaps scaled
conditions, can be even more equivocal than random as- from 0 to 1:
signment and combining such diverse quality measures k

into quality scales can compound such problems if the __ q w T i i i

T  ______
i 1
scales are not carefully constructed. For example, a re- k
.
cent study applied several different quality scales to the q w
i 1
i i
same meta-analytic data set, and found that the overall
results varied considerably depending upon which scale When all observed effect size indicators estimate a single
was used to do the weighting (Jni et al. 1999). Because population parameter,
__ as is hypothesized under a fixed-
all scales claimed to be measuring the same construct effects model, then T is an unbiased estimate of the popu-
(namely quality), and none is generally acknowledged to lation parameter . In equation 14.1, Ti may be estimated
be superior, we cannot know which result to believe. by any specific effect size statistic, including the stan-
dardized mean difference statistic d (for which the rele-
14.1.3 Examining Effects of Study Quality vant population parameter is ), the correlation coefficient
After Combining Results r (population parameter ) or its Fisher z transformation
(population parameter  ), the odds ratio o (population pa-
Weighting schemes that are a function of either within-
rameter ), the log odds ratio l (population parameter ),
study sample size or reliability and validity can routinely
the difference between proportions D (population param-
be used before combining studies. However, we cannot
eter ), or even differences between outcomes measured
recommend that other weighting schemes be used rou-
in their original metric. __
tinely before combining results: We do not yet know if
The weights that minimize the variance of T are in-
they generate more accurate inferences, and they can
versely proportional to the conditional variance (the square
generate less accurate inferences. However, the relation-
of the standard error) in each study:
ship of effect size to such scales should be freely explored
after combining results, and results with and without such 1.
wi  __ (14.2)
vi
weighting can be compared. In fact, such explorations are
one of the few ways to generate badly needed informa- Formulas for the conditional variances, vi vary for dif-
tion about the nature and direction of biases introduced ferent effect size indices, so we present them separately
by the many ways in which studies differ from each other. for each index in later sections. However, they have in
Methods for such explorations are the topics of other common the fact that conditional variance is inversely
chapters in this book. proportional to within-study sample sizethe larger the
sample, the smaller the variance, so the more precise the
14.2 FIXED-EFFECTS MODELS estimate of effect size should be. Hence, larger weights
Suppose that the data to be combined arise from a series are assigned to effect sizes from studies that have larger
of k independent studies, in which the ith study reports within-study sample sizes. Of course, equation 14.2 de-
one observed effect size Ti, with population effect size i, fines the optimal weights assuming we know the condi-
and variance vi. Thus, the data to be combined consist of tional variances, vi, for each study. In practice we must
k effect size estimates T1, . . . , Tk of parameters 1, . . . , k, estimate those variances (vi), so we can only estimate
and variances v1, . . . , vk. Under the fixed-effects model, these optimal weights (w i).
we assume 1  . . .  k  , a common effect size. Then Given the use__ of weights as defined in 14.2, the aver-
a general formula for the weighted average effect size age effect size T has conditional variance v, which itself
over those studies is is a function of the conditional variances of each effect
k size being combined:
__ w T i i
1 .
T  i_____
1
k , (14.1) v  ______
k (14.3)
wi
i 1
(1/vi)
i 1
262 STATISTICALLY COMBINING EFFECT SIZES

If the quality-weighted version of 14.1 is used to com- ification assumes that study effect sizes may vary from
pute the weighted average effect size, where the weights each other not only due to sampling error and fixed co-
wi are defined in 14.2, then the conditional variance of the variates, but also due to potentially random effects. The
quality-weighted average effect size is latter assumption seems counter to the assumptions inher-
k ent to fixed-effects models.
q i
2 wi A test of the assumption of equation 14.1 that studies
v  _______
i 1
. do, in fact, share a common population effect size uses
( )
k 2
qi wi the following homogeneity test statistic:
i 1
k __ k __
This simplifies to 14.3 if qi  1 for all iin effect, not Q  [(Ti  T )2/vi ]  wi (Ti  T )2. (14.6)
weighting for quality. i 1 i 1

The square root of v is the standard error of estimate of A computationally convenient form of 14.6 is
the average effect size. Multiplying the standard error by
( ) .
k 2

an appropriate critical value C__, and adding and subtract- k wiTi


ing the resulting product to T, yields the (95 percent) Q  w T  i i
2 i 1
_______
k
confidence interval (L ,U) for : i 1
w i 1
i

__ __
L  T  C(v)1/2, U  T  C(v)1/2. (14.4) If Q exceeds the upper-tail critical value of chi-square
at k 1 degrees of freedom, the observed variance in
C is often the unit normal C  1.96 for a two-tailed study effect sizes is significantly greater than we would
test at   .05; but better Type I error control and better expect by chance if all studies shared a common
__ popula-
coverage probability of the confidence interval may occur tion effect size. If homogeneity is rejected, T should not be
if C is Students t-statistic at k 1 degrees of freedom. In interpreted as an estimate of a single effect parameter 
either case, if the confidence interval does not contain that gave rise to the sample observations, but rather sim-
zero, we reject the null hypothesis that the population ply as describing a mean the of observed effect sizes, or
effect size  is zero. Equivalently, we may test the null as estimating a mean which may, of course, be of
hypothesis that   0 with the statistic practical interest in a particular research synthesis. If Q is
__ rejected, the researcher may wish to disaggregate study
T 
Z  _____
,
(14.5) effect sizes by grouping studies into appropriate catego-
(v )1/2

__ ries until Q is not rejected within those categories, or may
where | T| is the absolute value of the weighted average use regression techniques to account for variance among
effect__size over studies (given by equation 14.1). Under the i. These latter techniques are discussed in subsequent
14.5, T differs from zero if Z exceeds 1.96, the 95 percent chapters, as are methods for conducting sensitivity analy-
two-tailed critical value of the standard normal distribu- ses that help examine the influence of particular studies
tion. on combined effect size estimates and on heterogeneity.
The conditional variances used to compute 14.2 and Alternatively, we will see later that one can incorporate
estimates in subsequent equations must be estimated from this heterogeneity into ones model using random effects
the available data. Given that those conditional variances rather than fixed-effects models. For example, when the
are estimated with uncertainty, some have suggested mod- heterogeneity cannot be explained as a function of known
ifying estimates of significance tests and confidence in- study-level covariates (characteristics of studies such as
tervals to take into account the uncertainty of estimation type of treatment or publication status, or some aggre-
(see, for example, Hartung and Knapp 2001). The latter gate characteristic of patients within studies such as aver-
authors have suggested this in the context of random- age age), or when fixed-effects covariate models are not
effects models, but others have suggested its use more feasible because sufficient information is not reported in
generally. We present this procedure later in the chapter primary studies, random-effects models help take hetero-
under random-effects models, in part because the partic- geneity into account when estimating the average size of
ular modification involves a weighted estimate of the effect and its confidence interval (Hedges 1983; DerSi-
random-effects variance first introduced in that section as monian and Laird 1986). Essentially, in the presence of
equation 14.20, butmore importantbecause the mod- unexplained heterogeneity, the random-effects test is more
COMBINING ESTIMATES OF EFFECT SIZE 263

conservative than the fixed-effects test, for reasons ex- ance is large. In addition the magnitude of Q also depends
plained later (Berlin et al. 1989). Finally, sometimes a on which effect size metric is used, presenting a problem
move to random-effects models is suggested for purely for comparing heterogeneity across meta-analyses that
theoretical reasons, as when levels of the variable are use different effect size measures.
thought to be sampled from a larger population to which In recognition of some of these issues, Q is usefully
one desires to generalize (Hunter and Schmidt 2004). supplemented by the descriptive statistic:
In both the fixed- and random-effects cases, however,
Q  (k 1)
Q is a diagnostic tool to help researchers know whether
they have, to put it in the vernacular, accounted for all the
(
I 2 100%* _________
Q
. )
variance in the effect sizes they are studying. Experience Negative values of I 2 are set to zero. I 2 is the proportion
has shown that Q is usually rejected in most simple, of total variation in the estimates of treatment effects that
fixed-effects univariate analysesfor example, when the is due to heterogeneity rather than to chance (Higgins and
researcher simply lumps all studies into one category or Thompson 2002; Higgins et al. 2003). Values of I 2 are not
contrasts one category of studies, such as randomized ex- affected by the numbers of studies or the effect size met-
periments, with another category of studies, such as quasi- ric. Julian Higgins and Simon Thompson suggested the
experiments. In such simple category systems, rejection following approximate guidelines for interpreting this
of homogeneity makes eminent theoretical sense. Each statistic: I 2  25% (small heterogeneity), I 2  50% (me-
simple categorical analysis can be thought of as the re- dium heterogeneity), and I 2  75% (large heterogeneity)
searchers theoretical model about what variables account (2002).
for the variance in effect sizes. We would rarely expect In addition, Higgins and Thompson presented equa-
that just one variable, or even just two or three variables, tions to compute confidence intervals around I 2, whereby
would be enough to account for all observed variance. homogeneity is rejected if the confidence interval does
The phenomena we study are usually far more complex not include zero, as an alternative to the Q-test for homo-
than that. Often, extensive multivariate analyses are re- geneity (2002). However, the sampling distribution of I 2
quired to model these phenomena successfully. In essence, is derived from the sampling distribution of Q, so it must
then, the variables that a researcher uses to categorize or have exactly the same power characteristics as Q does, as
predict effect sizes can be considered to be the model that has been shown empirically (Huedo-Medina et al. 2006).
the researcher has specified about what variables gener- Consequently, we recommend using I 2 as a descriptive
ated study outcome. The Q statistic tells whether that statistic without the confidence intervals to supplement Q
model specification is statistically adequate. Thus, though rather than to replace Q.
homogeneity tests are not without problems, they can In the following subsections, we use the equations just
serve a valuable diagnostic function (Bohning et al. 2002; developed to show how to combine standardized mean
Boissel et al. 1989; Fleiss and Gross 1991; Spector and differences, correlations, various results from fourfold ta-
Levine 1987). bles, and raw mean differences. These equations hinge on
The power of the Q test for homogeneity has been dis- knowing the conditional variance of the effect-size esti-
cussed often (Hardy and Thompson 1998; Harwell 1997; mates, for use in equations 14.2 and 14.3. Sometimes
Jackson 2006; Takkouche et al. 1999). Q tests a very gen- alternative estimators of conditional variance or of homo-
eral omnibus hypothesis that all possible contrasts are geneity are available that might be preferable in some
zero. For that omnibus hypothesis, Q is the most powerful situations to those we use (DerSimonian and Laird 1986;
unbiased test possible. Larry Hedges and Therese Pigott Guo et al. 2004; Hedges and Olkin 1985; Knapp, Big-
provided formulas for computing the power of Q (2001). gerstaff, and Hartung 2006; Kulinskaya et al. 2004; Paul
These formulas show that the power of the test based on and Thedchanamoorthy 1997; Reis, Kirji, and Afifi 1999).
Q depends on the noncentrality parameter and that the The estimates we present should prove satisfactory in most
test will have low power whenever the noncentrality pa- situations, however. In addition, equations 14.1 to 14.6
rameter is small. The noncentrality parameter is basically can usually be used in the meta-analysis of other effect-
k times the ratio of between study variance in effect pa- size indicators as long as a conditional variance for the
rameters to sampling error variance. Thus you tend to get estimator is known. For example, Stephen Raudenbush
low power when the number of studies (k) is small, be- and Anthony Bryk presented conditional variances for the
tween study variance is small, and the sampling error vari- meta-analysis of standard deviations and logits (2001), and
264 STATISTICALLY COMBINING EFFECT SIZES

Jonathan Blitstein and his colleagues (2005) did the same Table 14.2 Computational Details of Standardized
for the meta-analysis of intraclass correlations. Mean Difference Example

Study di vi wi wi di wi di2
14.2.1 Fixed EffectsStandardized
Mean Differences 1 0.03 0.01563 64.000 1.9200 0.0576
2 0.12 0.02161 46.277 5.5532 0.6664
Suppose that the effect-size statistic to be combined is the 3 0.14 0.02789 35.856 5.0199 0.7028
standardized mean difference statistic d: 4 1.18 0.13913 7.188 8.4813 10.0080
__ __
Y iT  Y iC 5 0.26 0.13616 7.344 1.9095 0.4965
di  _______
si , 6 0.06 0.01061 94.260 5.6556 0.3393
7 0.02 0.01061 94.260 1.8852 0.0377
where __YiT is the mean of the treatment group in the ith 8 0.32 0.04840 20.661 6.6116 2.1157
study, YiC is the mean of the control group in the ith study, 9 0.27 0.02690 37.180 10.0387 2.7104
and si is the pooled standard deviation of the two groups. 10 0.80 0.06300 15.873 12.6982 10.1586
A correction for small sample bias can also be applied 11 0.54 0.09120 10.964 5.9208 3.1972
(Hedges and Olkin 1985, 81). If the underlying data are 12 0.18 0.04973 20.109 3.6196 0.6515
13 0.02 0.08352 11.973 0.2395 0.0048
normal, the conditional variance of di is estimated as 14 0.23 0.08410 11.891 2.7348 0.6290
nT  nC d2 15 0.18 0.02528 39.555 7.1200 1.2816
vi  _______
i
n Tn C
i
 _________
i
2(n T  n C )
, (14.7) 16 0.06 0.02789 35.856 2.1514 0.1291
i i i i
17 0.30 0.01932 51.757 15.5271 4.6581
where niT is the within-study sample size in the treatment 18 0.07 0.00884 113.173 7.9221 0.5545
group of the ith study, niC is the within-study sample size 19 0.07 0.03028 33.029 2.3121 0.1618
in the comparison group of the ith study, and di estimates Sum 751.206 45.3300 38.5606
the population parameter .
Consider how this applies to the nineteen studies of the source: Authors compilation.
effects of teacher expectancy on pupil IQ in data set II.
Table 14.2 presents details of this computation. (Through-
out this chapter, results will vary slightly depending on variation in effect sizes is due to heterogeneity. We might
the number of decimals carried through computations; therefore assume that other variables could be necessary to
this can particularly affect odds ratio statistics, especially explain fully the variance in these effect sizes; subsequent
when some__proportions are small). The weighted average chapters in this book will explore such possibilities.
effect size d  45.3303/751.2075  0.06 with a variance
of v 1/751.205  0.00133, and standard error  (1/751. 14.2.2 Fixed EffectsCorrelations
2075)1/2  0.03649. We test the significance of this effect
14.2.2.1 Fixed Effectsz-Transformed Correlations
size in either of two equivalent ways: by computing the
For the correlation coefficient r, some authors recommend
95 percent confidence interval, which in this case ranges
first converting each correlation by Sir Ronald Fishers
from L  [(.06)1.96(.036)]  .01 to U  [(.06)1.96
variance stabilizing z transform (1925):
(.036)].13, which includes __ zero in the confidence inter-
val, and the statistic Z   d/(v)1/2  0.06/0.036 1.65, zi  .5{ln[(1 ri)/(1 ri)]},
which does not exceed the 1.96 critical value at 0.05.
Hence we conclude that teacher expectancy has no sig- where ln(x) is the natural (base e) logarithm of x. The cor-
nificant effect on student IQ. However, homogeneity of relation parameter corresponding to zi is
effect size is rejected in this data, with the computational
version of equation 14.6, yielding Q  [38.56  (45.33)2/ i  .5{ln[(1 i)/(1 i)]}.
751.21]  35.83, which exceeds 34.81, the 99 percent
critical value of the chi-square distribution for k 1  If the underlying data are bivariate normal, the condi-
19 1 18 degrees of freedom, so we reject homogeneity tional variance of zi is
of effect sizes at p  .01. Similarly, I 2  100%  [35.83  1 ,
(19 1)]/35.83  49.76%, suggesting that nearly half the vi  _______
(n  3) (14.8)
i
COMBINING ESTIMATES OF EFFECT SIZE 265

where ni is the within-study sample size of the ith study. where ri is the single study correlation and ni is the within-
Note that this variance depends only on within-study study sample size in that study. Note that this variance
sample size and not on the correlation parameter itself, a depends not only on within-study sample size, but also on
desirable property; and when correlations are small the size of the original correlation. Use of 14.1 to 14.6 is
(r  .5), zi is close to ri so the variance of ri is close to the straightforward. Using data set_ III, we obtain an estimate
variance of zi. Using this variance _
estimate in equations of the common correlation
_
of r  337.00/847.1800.40,
14.1 through 14.6, an estimate z and a confidence interval with a variance of v 1/847.18  0.0012, yielding a stan-
bounded by L and U are _obtained directly. Interpretation dard error of .0344. This average_ correlation is signifi-
of results is facilitated if z, L, and U__ are converted back cantly different from zero [Z| r|/(v)1/2  0.40/0.035 
again to the metric of a correlation, , via the inverse of 11.43], with the limits of the 95 percent confidence interval
the r to z transform given by being L  0.40  (1.96)  0.035  0.33 and U  0.3978 
(1.96)  0.035  0.47. Homogeneity of effect size is not
(z)  (e 2z 1)/(e 2z 1), (14.9) rejected [Q  159.687  (337.00)2/847.18  25.63; df 
where e x  exp(x) is the exponential (antilog) function, k 1  19; p  .14]; a value of Q this large would be ex-
yielding both an estimate of  and also upper and lower pected about 15 percent of the time if all the population
bounds (L, U) around that estimate. correlations were identical. Here, I 2  100%  [25.63
As an example, data set III reported correlations from (20 1)]/25.63  25.87%, suggesting a quarter of the
twenty studies between overall student ratings of the in- variation in effect sizes is due to heterogeneity.
structor and student achievement. We obtain an average z Two points should be noted in comparing these two
_ methods of combining correlations. First, both methods
transform of z  201.13/530.00  0.38, and a variance v 
(1/530.00)  0.0019, the square root of which (0.044) is yield roughly the same point estimate of the average
the standard error. This average effect size is significantly correlation, a result noted elsewhere (Field 2005; Law
_ 1995). However, few statisticians would advocate the use
different from zero since Zz/(v)1/20.38/0.0448.64,
which exceeds the critical value of 1.96 for   0.05 in the of untransformed correlations unless sample sizes are
standard normal distribution. The limits of the 95 percent very large because standard errors, confidence intervals,
confidence interval are L  [0.38 1.96(0.044)]  0.294 and homogeneity tests can be quite different. The present
and U  [0.38 1.96(0.044)]  0.466. For interpretability, data illustrate this, where the probability associated with
we can use equation (14.9) to transform the zs back into the Q test for the untransformed correlations was only
_ about 40 percent as large as the probability associated
estimates
__
of correlations, because 2 z  2(0.38)  0.76,
  (e0.76 1)/(e0.76 1)  0.36, where e.76  2.14. Trans- with the z-transformed correlations. The reason is the
forming the upper and lower bounds using the same equa- poor quality of the normal approximation to the distribu-
tion yields L  0.287 and U  0.434. Hence we conclude tion of r, particularly when the population correlation is
that the better the ratings of the instructor, the better is relatively large. Second, the approximate variances in
student achievement. Homogeneity of correlations is not 14.8 and 14.10 assume bivariate normality to the raw
rejected [Q  97.27  (201.13)2/530.00  20.94, df  19, measurements; however, there should be a good deal of
p  0.338]. A Q value this large would be expected be- robustness of the distribution theory, just as there is for
tween 30 and 35 percent of the time if all the population the r distribution. A similar argument applies to the ro-
correlations were identical. In this case, I 2  100%  bustness of variance of the standardized mean difference
[20.94  (20 1)]/20.94  9.26%, suggesting very little statistic under nonnormality.
of the variation in effect sizes is due to heterogeneity.
14.2.2.2 Fixed EffectsCorrelation Coefficients Di- 14.2.3 Fixed EffectsFourfold Tables
rectly Other authors argue that the average z transform is
positively biased, so they prefer combining correlations Fourfold tables report data concerning the relationship be-
without z transform (see, for example, Hunter and Schmidt tween two dichotomous factorssay, a risk factor and a
2004). In that case, if the underlying data are bivariate disease or treatment-control status and successful outcome
normal, the variance of the correlation coefficient, r, is at posttest. The dependent variable in such tables is always
approximately either the number of subjects or the percentage of subjects
in the cell. Table 14.3 presents such a fourfold table con-
vi  (1 ri2 )2 / (ni 1), (14.10) structed from the data in the first row of table 14.1the
266 STATISTICALLY COMBINING EFFECT SIZES

Table 14.3 Fourfold Table from Pearsons Hospital odds ratio using this formula for the data in table 14.3
Staff Incidence Data yields oi  (0.892/1.08)/(0.731/0.269)  3.04, the same as
before.
Condition of Interest From either odds ratio, we may also define the log odds
Group Immune Diseased All
ratio li  ln(oi), where ln is the natural logarithm and is
equal to li  ln(3.04)  1.11 for the data in table 14.3. An
Inoculated A  265 B  32 M1  297 advantage of the log odds ratio is that it takes on a value
Not Inoculated C  204 D  75 M0  279 of zero when no relationship exists between the two fac-
tors, yielding a similar interpretation of a zero effect size
Total N1  469 N0  107 T  576 as r and d.
source: Authors compilation. We may also compute the difference between propor-
tions Di  pi1  pi2, which in table 14.3 yields Di  0.892 
0.731  0.161, interpretable as the actual gains expected
incidence data for the hospital staff sample. In table 14.3, in terms of the percentage of patients treated. Rebecca
the risk factor is immunization status and the disease is DerSimonian and Nan Laird (1986) and Jesse Berlin and
typhoid incidence. We could construct a similar fourfold colleagues (1989) discussed the relative merits of these
table for each row in table 14.1. From such primary data, various indicators. Use of these statistics in meta-analysis
various measures of effect size can be computed. has been largely limited to medicine and epidemiology,
One measure is the odds ratio, which may be defined in but should apply more widely. For example, psycho-
several equivalent ways. One formula depends only on therapy researchers often report primary data about per-
the four cell sizes A, B, C, and D (see table 14.3); centage success or failure of clients in treatment or control
groups. Analyzing the latter data as standardized mean
AD .
oi  ___
BC difference statistics or correlations can lead to important
errors that the methods presented in this section avoid
If any cell size is zero, then 0.5 should be added to all (Haddock, Rindskopf, and Shadish 1998).
cell sizes for that sample; however, if the total sample size 14.2.3.1 Fixed EffectsDifferences Between Pro-
is small, this correction may not work well. This version portions Some studies report the proportion of the
of the odds ratio is sometimes called the cross-product treatment (exposed) group with a condition (a disease,
ratio because it depends on the cross-products of the table treatment success, status as case or control) and the pro-
cell sizes. Computing this odds ratio based on the data in portion of the comparison (unexposed) group with that
table 14.3 yields oi  3.04, suggesting that the odds that condition. The difference between these proportions,
inoculated subjects were immune to typhoid were three sometimes called a rate difference, is Di  pi1 pi2. The
times the odds for subjects not inoculated. The same odds rate difference has a variance of
ratio can also be computed from knowledge of the pro- p (1 p ) p (1 p )
i2 ,
portion of subjects in each condition of a study who have vi  _________
i1
ni1
i1
 _________
i2
ni2 (14.12)
the disease:
p (1 p )
where all terms are defined as in equation 14.11 and nij is
oi  _________
i1 i2 ,
p (1 p ) (14.11) the number of subjects in condition j of study i. Use of
i2 i1
14.1 to 14.6 is again straightforward.
where pij is the proportion of individuals in condition j of Using these methods to combine rate differences for
study i who have the disease. In the incidence data in the incidence data
__ in table 14.1, we find that the average
table 14.1, for example, if one wants to assess the odds of rate difference D 18811.91/1759249  .011, with __ a stan-
being immune to typhoid depending on whether the sub- dard error of .00075. The significance test Z   D/(v)1/2 
ject was inoculated, p11 is the proportion of subjects in the 14.67, which exceeds the 1.96 critical value at   0.05.
inoculation condition of the hospital staff study who were The 95 percent confidence interval is bounded by L 
immune to typhoid ( p11 265/297  0.8923); and p32 is 0.0111.96  0.00075  0.0095 and U  0.0111.96 
the proportion of subjects in the noninoculated condition 0.00075  0.012; that is, about 1 percent more of the
of the Methuens column study who remained immune inoculated soldiers would be expected to be immune
to typhoid ( p32 10724/10981 0.9766). Computing the to typhoid compared with the noninoculated soldiers.
COMBINING ESTIMATES OF EFFECT SIZE 267

However, homogeneity of effect size is rejected, with where all terms are defined in equation 14.11, and nij
Q  763.00  (18812)2/1759249  561.84, which exceeds is the number of subjects in condition j of study i. In
14.86, the 99.5 percent critical value of the chi-square table 14.1, for example, n11  297 is the number of sub-
distribution at 5  1  4 df, so we reject homogeneity of jects in the inoculated condition of the hospital staff
rate differences over these five studies at significance study; and n23  10,981 is the number of subjects in the
level .005. The I 2  100%  [561.84  (5  1)]/561.84  noninoculated condition of the Methuens column study.
99.29%, suggesting nearly all the variation in effect sizes Using this variance estimate in equations 14.1 through
is due to heterogeneity. 14.6 is again straightforward. Subsequently, if the re-
This example points to an interesting corollary prob- searcher wants to report final results in terms of the
lem with the weighting scheme used in this chapter (this mean odds ratio over studies,_ this is done by transforming
problem and its solutions apply equally to combining the average
__
log odds
_ ratio, l, back into an average odds
other effect size indicators such as d or r). The incidence ratio o  antilog(l), and similarly by reporting upper
data in table 14.1 come from studies with vastly discrep- and lower bounds as odds ratios, L  antilog(L) and
ant within-study sample sizes. In an analysis in which U  antilog(U).
weights depend on within-study sample size, the influ- For example, using the incidence_ data in table 14.1, we
ence of the one study with over 150,000 subjects can find the average log odds _
ratio l 188.005/230.937 
dwarf the influence of the other studies, even though some 0.814, with variance v 1/230.937  0.0043, yielding _
of their sample sizesat more than 10,000 subjectsare a standard error of .066. The significance test Z   l/
large enough that these studies may also provide reliable (v)1/2  12.33, which exceeds the 1.96 critical value at  
estimates of effect. One might bound weights to some a 0.05; and the 95 percent confidence interval is bounded by
priori maximum to avoid this problem, but bounding L .8141.96  0.0660.685 and U 0.8141.96
complicates the variance of the weighted mean, which is 0.066  0.943. Returning the log odds ratios to their orig-
no longer equation 14.3, and bounding produces a less inal metrics via the__
antilog transformation yields a com-
efficient (more variable) estimator. An alternative is to mon odds ratio of o  antilog(0.814)  2.26 with bounds
compute estimates separately for big and other stud- of L  antilog(0.685)  1.98 and U  antilog(0.943) 
ies, an option we do not pursue here only because the 2.57. The common odds ratio of 2.26 suggests that the
number of studies in table 14.1 is small. One might also odds that inoculated soldiers were immune to typhoid
redo analyses excluding one or more studiessay, the were twice the odds for soldiers not inoculated; and this
study with the largest sample size, or the smallest, or improvement in the odds over groups is statistically sig-
bothto assess stability of results. So, for example, if we nificant. However, homogeneity of effect size is rejected,
eliminate the samples with the largest (army in India) and with Q  230.317  (188.005)2/230.937  77.26, substan-
smallest (hospital staffs) number __ of subjects, the rate dif- tially larger than 14.86, the 99.5 percent critical value
ference increases somewhat to D  0.034, still signifi- of the chi-square distribution with 5 1 4 df; that is,
cantly different from zero and heterogeneous. the odds radios do not seem to be consistent over these
14.2.3.2 Fixed EffectsLog Odds Ratios Joseph five studies. The I 2  100%  [77.26  (5 1)]/ 77.26 
Fleiss distinguished two different cases in which we 94.82%, again suggesting nearly all the variation in effect
might want to combine odds ratios from multiple four- sizes is due to heterogeneity.
fold tables (1981). In the first, the number of studies to be 14.2.3.3 Fixed EffectsOdds Ratios with Mantel-
combined is small, but the within-study sample sizes per Haenszel The second case that Fleiss distinguishes is
study are large, as with the incidence data in table 14.1. when one has many studies to combine, but the within-
The proper measure of association to be combined is thus study sample size in each study is small. In this case, the
the log odds ratio li, the variance of which is general logic of equations 14.1 to 14.6 does not dictate
analysis. Instead, we compare the Mantel-Haenszel sum-
1 __
1 __
1 __
1,
vi  __
A B C D
mary estimate of the odds ratio:
i i i i
k

or, alternatively, __ w [ p i (1 pi2)]


i1
oMH  i____________
1
k
, (14.14)
i1
1
vi  ___________
i1 i1
1
___________
n p (1 p )  n p (1 p )
i2
,
i2 i2
(14.13) wi[ pi2(1 pi1)]
i 1
268 STATISTICALLY COMBINING EFFECT SIZES

where the proportions are as defined for 14.11 and where where Oi  Ai, Ei  [(Ai  Ci)/Ti](Ai  Bi), and Vi  Ei
n n [(Ti  (Ai  Ci)/Ti][(Ti  (Ai  Bi)/(Ti  1)]. The difference
wi  ______
i1 i2
n n
. (Oi  Ei) in the numerator is simply the difference be-
i1 i2
tween the observed number of treatment patients who ex-
A formula that is equivalent to 14.14, but computed perienced the outcome event, O, and the number expected
from cell sizes as indexed in table 14.3, is to have done so under the hypothesis that the treatment is
k no different from the control, E (Fleiss and Gross 1991,
__ A D /T i i i 131). In the denominator, Vi is the variance of the differ-
oMH  _______
i 1
k
, (14.15) ence (Oi  Ei), which can also be expressed as the product
BiCi / Ti
i 1
of the four marginal frequencies divided by Ti2(Ti 1), or
[(Ai  Ci) (Ai  Bi )(Bi  Di)(Ci  Di)]/[Ti2(Ti 1)]. The re-
where Ti is the total number of subjects in the sample. sult of this chi-square is compared with the critical value
Note that in the case of k  1 (when there is just one of chi-square with 1 df. Homogeneity may be tested using
study), the Tis cancel from the numerator and denomina- the same test developed for the log odds ratio. However,
tor of 14.15, which then reduces to the odds ratio defined
__ better homogeneity tests, although computationally com-
in 14.11. An estimate of the variance of the log of oMH plex, are available (Jones et al. 1989; Kuritz, Landis, and
(Robins, Breslow, and Greenland 1986; Robins, Green- Koch 1988; Paul and Donner 1989).
land, and Breslow 1986) is We illustrate the Mantel-Haenszel technique by apply-
k k k ing it to the fatality data in table 14.1. The Mantel-Haenszel
P R i i (P S  Q R )
i i i i Q S i i common odds ratio is 218.934/123.976  1.77; that is,
v  _______
i1
2 
i___________
1
 ______
i 1
, (14.16) the odds that inoculated subjects who contracted typhoid
( ) ( )( ) ( )
k k k k 2
2 R
i 1
i 2 R S
i 1
i
i 1
i 2 S
i 1
i
would survive were almost two times the odds that nonin-
oculated subjects who contracted typhoid would survive.
where Pi  (Ai  Di)/Ti, Qi  (Bi  Ci)/Ti, Ri  AiDi/Ti, and The variance of the log of this ratio is v  63.996/
Si  BiCi/Ti. When k  1, (14.16) reduces to the standard [2(218.934)2]  190.273/[2(218.934)(123.976)]  88.64/
equation for the variance of a log odds ratio in a study as [2(123.976)2]  0.00706, and the 95 percent confidence
defined in 14.13. Then, the 95 percent confidence inter- interval is bounded by L  antilog[log(1.77) 1.96 
__
val is bounded by__L  antilog[log(oMH)  1.96(v)1/2] and (0.007)1/2]  1.50 and U  antilog[log(1.77) 1.96 
U  antilog[log(oMH)  1.96(v)1/2]. (0.007)1/2]2.08. The Mantel-Haenszel 2(94.96.5)2/
To test the significance of the overall Mantel-Haenszel 191.31  46.64, p  0.00001, suggesting that inoculation
odds ratio, we can compute the Mantel-Haenszel chi- significantly increases the odds of survival among those
square statistic: who contracted typhoid. Finally, homogeneity of odds
ratios is not rejected over these six studies, Q  50.88 

{  } [(78.97)2/141.46]  6.80, df  5, p  0.24); a Q value this


k 2
[(ni1ni2)/(ni1  ni2)](pi1  pi2)  .5
i 1
large would be expected between 20 and 25 percent of the
2MH  ___________________________
k
, (14.17) time if all the population effect sizes were identical. The
{[(ni1ni2)/(ni1  ni2 1)]pi(1 pi)}
i 1
I 2  100%  [6.80  (6 1)]/6.80  26.47%, suggesting
that nearly half the variation in effect sizes is due to het-
where erogeneity.
The results of a Mantel-Haenszel analysis are valid
n p n p
pi  ___________
i1 i1
n n
i2 i2
. whether one is analyzing many studies with small sample
i1 i2
sizes or few studies with large sample sizes; but the re-
Alternatively, 14.17 can be written in terms of cell fre- sults of the log odds ratio analysis in section 14.2.3.2
quencies and marginals as are valid only in the latter case. Therefore, we can reana-

[ 
lyze the incidence data in table 14.1 using the Mantel-
]
k 2
(OiEi)  .5 Haenszel techniques and compare the results to the log
2MH  ______________
i1
, odds ratio analysis. The Mantel-Haenszel common odds
k
ratio for the incidence data is 537.007/205.823  2.61;
i V
i1 that is, the odds that inoculated subjects who contracted
COMBINING ESTIMATES OF EFFECT SIZE 269

typhoid would survive were two and a half times the Table 14.4 Effects of Psychological Intervention on
odds that noninoculated subjects who contracted typhoid Reducing Hospital Length of Stay (in days)
would survive. The variance of the log of this ratio is
Psychotherapy Control
v  125.355/ [2(537.007)2]  459.181/[2(537.007) Groups Group
 (205.823)] 158.294/ [2(205.823)2]  0.00416, Mean Mean Mean
and the 95 percent confidence interval is bounded by Study N Days SD N Days SD Difference
L  antilog[log(2.61) 1.96  (.004)1/2]  2.30 and U 
1 13 5.00 4.70 13 6.50 3.80 1.50
antilog[log(2.61) 1.96  (v.004)1/2]  2.96. The Mantel-
2 30 4.90 1.71 50 6.10 2.30 1.20
Haenszel 2  (331.185  .5)2/462.879  236.244, p  3 35 22.50 3.44 35 24.90 10.65 2.40
0.00001, suggesting that inoculation significantly increases 4 20 12.50 1.47 20 12.30 1.66 .20
the odds of survival among those who contracted typhoid. 5 10 3.37 .92 10 3.19 .79 .18
Finally, homogeneity of odds ratios is rejected over these 6 13 4.90 1.10 14 5.50 .90 .60
five studies, Q 230.373  [(188.188)2/231.173]  77.18, 7 9 10.56 1.13 9 12.78 2.05 2.22
which greatly exceeds 14.86, the 99.5 percent critical 8 8 6.50 .76 8 7.38 1.41 .88
value of the chi-square distribution at 5  1  4 df. Under
the Mantel-Haenszel techniques, therefore, the odds ratio source: Adapted from Mumford et al. 1984, table 1.
Includes only randomized trials that reported means and standard
was somewhat larger than under the log odds approach; deviations.
but the variance and Q test results were very similar, and
the confidence interval was about as wide under both
approaches. The I 2  100%  [77.18  (5  1)]/77.18  mogeneous, but different studies need not have the same
94.82%, so that nearly all the variation in effect sizes is variances). The latter can be estimated by the pooled
due to heterogeneity. within-group variance if the within-study sample sizes
and standard deviations of each group are reported:
14.2.4 Combining Data in Their Original Metric spi2  [ ( niT1 )( siT )2  ( niC 1 )( siC )2 ] / [ niT  niC  2 ],
On some occasions, the researcher may wish to combine
where (siT)2 is the variance of the treatment group in the
outcomes over studies in which the metric of the outcome
ith study, and (siC)2 is the variance of the control group or
variable is exactly the same across studies. Under such
the comparison treatment in the ith study. The pooled
circumstances, converting outcomes to the various effect-
variance can sometimes be estimated other ways. Squar-
size indicators might be not only unnecessary but also
ing the pooled standard deviation yields an estimate of
distracting since the meaning of the original variable may
the pooled variance. Lacking any estimate of the pooled
be obscured. An example might be a synthesis of the ef-
variance at all, one cannot use 14.1 to 14.6; but then one
fects of treatments for obesity compared with control
cannot compute a standardized mean difference, either.
conditions, where the dependent variable is always weight
As an example, Emily Mumford and her colleagues
in pounds or kilograms; in this case computing a stan-
synthesized studies of the effects of mental health treat-
dardized mean difference score for each study might be
ment on reducing costs of medical utilization, the latter
less clear than simply reporting outcomes as weight in
measured by the number of days of hospitalization (1984).
pounds. Even so, the researcher should still weight each
From their table 1, we extract eight randomized studies
study by a function of sample variance. If Ti is a mean
comparing treatment groups that received psychological
difference
__ __ between treatment and control group means, intervention with control groups, which report means,
Ti Y Ti  Y Ci , then an estimate of the variance of Ti is
within-study sample sizes, and standard deviations (see
vi  i2(1/niT 1/niC ), (14.18) table 14.4). The weighted average of the difference be-
tween the number of days the treatment groups were hos-
where niT is the within-study sample size in the treatment pitalized versus the number__ of days the control groups
group, niC is the within-study sample size for the control were hospitalized was T  14.70/27.24  0.54 days;
group, and i2 is the assumed common variance (note that that is, the treatment group left the hospital about half a
this assumes the two variances within each study are ho- day sooner than the control group. The variance of this
270 STATISTICALLY COMBINING EFFECT SIZES

effect is v  1/27.24  0.037, yielding a standard error this question exists. Statistically, rejection of homoge-
of 0.19. The 95 percent confidence interval is bounded by neity implies the presence of a variance component esti-
TL  0.54  1.96  (0.19)1/2  0.91
__ and TU  0.54  mate that is significantly different from zero, which
1.96(0.19)1/20.17, and ZT/v1/2 0.54/0.19  might, in turn, indicate the presence of random effects to
2.84 ( p  0.0048), both of which indicate that the effect be accounted for. However, this finding is not definitive
is significantly different from zero. However, the homo- because the homogeneity test has low power under some
geneity test closely approaches significance, with Q  circumstances, and because the addition of fixed-effects
[21.85  (14.70)2/27.24] 13.92, df  7, p  0.053, sug- covariates are sometimes enough to produce a nonsignifi-
gesting that mean differences may not be similar over cant variance component. Some authors argue that ran-
studies. This result is supported by the I 2  100%  dom-effects models are preferred on conceptual grounds
[13.92  (8  1)]/13.92  49.71%, suggesting that half the because they better reflect the inherent uncertainty in
variation in effect sizes is due to heterogeneity. meta-analytic inference, and because they reduce to the
fixed-effects model when the variance component is zero.
Larry Hedges and Jack Vevea noted that the choice of
14.3 RANDOM-EFFECTS MODELS model depends on the inference the researcher wishes to
make (1998). The fixed-effects model allows inferences
Under a fixed-effects model, the effect-size statistics, about what the results of the studies in the meta-analysis
Ti, from k studies estimate a population parameter would have been had they been conducted identically ex-
1  . . .  k  . The estimate Ti in any given study dif- cept for using other research participants (that is, except
fers from the  due to sampling error, or what we have for sampling error). The random-effects model allows
referred to as conditional variability; that is, because a inferences about what the results of the studies in the
given study used a sample of subjects from the popula- meta-analysis might have been had they been conducted
tion, the estimate of Ti computed for that sample will dif- with other participants but also with changes in other
fer somewhat from  for the population. study features such as in the treatment, setting, or outcome
Under a random-effects model, i is not fixed, but is measures. So the choice between fixed- and random-
itself random and has its own distribution. Hence, total effects models must be made after careful consideration
variability of an observed study effect-size estimate vi* of the applicability of all these matters to the particular
reflects both conditional variation vi of that effect size meta-analysis being done.
around each population i and random variation, 2, of Once the researcher decides to use a random-effects
the individual i around the mean population effect size: analysis, a first task is to determine whether the variance
component differs significantly from zero and, if it does,
vi*  2  vi . (14.19) then to estimate its magnitude. The significance of the
variance component is known if the researcher has al-
In this equation, we will refer to 2 as either the be- ready conducted a fixed-effects analysisif the Q test for
tween-studies variance or the variance component (others homogeneity of effect size (equation 14.6) was rejected,
sometimes refer to this as random effects variance); to vi the variance component differs significantly from zero. In
as either within-study variance or the conditional vari- that case, the remaining task is to estimate the magnitude
ance of Ti (the variance of an observed effect size condi- of the variance component so that 14.19 can be estimated.
tional on  being fixed at the value of i; others sometimes If the investigator has not conducted a fixed-effects anal-
call this estimation variance); and to vi* as the uncondi- ysis already, Q must be calculated in addition to estimat-
tional variance of an observed effect size Ti (others some- ing the variance component.
times call this the variance of estimated effects). If the Estimating the variance component can be done in
between-studies variance is zero, then the equations of many different ways (Viechtbauer 2005). A common
the random-effects model reduce to those of the fixed- method begins with the ordinary (unweighted) sample
effects model, with unconditional variability of an ob- estimate of the variance of the effect sizes, T1, . . . , Tk,
served effect size [vi*] hypothesized to be due entirely to computed as
conditional variability [vi] (to sampling error). k __
The reader may wonder how to choose between a fixed s2(T)  [(Ti  T )2/(k 1)], (14.20)
versus random effects model. No single correct answer to i 1
COMBINING ESTIMATES OF EFFECT SIZE 271
__
where T is the unweighted mean of T1 through Tk. A com- Lynn Friedman offered the following suggestions for
putationally convenient form of (14.20) is choosing between these two estimators (2000):
If Q is not rejected and 14.23 is negative, assume
[ ( ) / ]/
k k 2
s2(T)  Ti2  Ti k (k 1). 2  0. Equation 14.23 is positive, use the estimate
i 1 i 1 produced by equation 14.23, especially if the studies
in the meta-analysis each have small total sample
The expected value of s2(T), that is, the unconditional sizes  40.
variance we would expect to be associated with any par-
ticular effect size, is If Q is rejected and equation 14.23 is positive but
equation 14.22 is negative, use 14.23 because it is
k
more efficient when 2 is small. Equations 14.22 and
E[s2(T)]  2  (1/k) 2(Ti  i). (14.21) 14.23 are both positive, use 14.23 if Q  3(k 1),
i 1
and use 14.22 otherwise.
The observed variance s2(T) from equation (14.20) is
an unbiased estimate of E[s2(T)] by definition. To estimate So equation 14.22 should only be used if Q is rejected
2(Tii), we use the vi from equations 14.7, 14.8, 14.10, and is very large relative to the number of studies in the
14.12, 14.13, 14.16, or 14.18, depending on which effect- meta-analysis. Otherwise equation 14.23 should be used.
size indicator is being aggregated. With these estimates, Neither of the estimators 14.22, 14.23 described above
equation 14.21 can be solved to obtain an estimate of the are always optimal (Viechtbauer 2005). In particular,
variance component: 14.22 tends to be less efficient than 14.23, but 14.23
k works best when the conditional variances of each effect
2  s2(T)(1/k)vi . (14.22) size are homogenous, something unlikely to be the case
i1 in practice with sets of studies that have notably varying
This unbiased sample estimate of the variance compo- sample sizes. A restricted maximum likelihood estimator
nent will sometimes be negative, even though the popula- described by Wolfgang Viechtbauer (2005) tended to per-
tion variance component must be a positive number. In form better than either 14.22 or 14.23, but must be esti-
these cases, it is customary to fix the component to zero. mated with an iterative procedure not widely implemented
A problem with equation 14.22 is that it may be nega- in available software.
tive or zero even though the homogeneity statistic Q re- When the variance
__ component is significant, one may
jects the null hypothesis that 2  0. The second method also compute T*, the random effects weighted mean of
for estimating the variance component avoids that prob- T1, . . . , Tk, as an estimate of
, the average of the__ ran-
lem. It begins with Q as defined in equation 14.6, which dom effects in the population; v*, the variance
__ of T* (the
we now take as an estimate if the weighted sample esti- square root of v* is the standard error of T*); and random-
mate of the unconditional variance 2(Ti). The expected effects confidence limits L and U for
 by multiplying
value of Q is the standard error by an appropriate critical value (often,
1.96 at   .05), __ and adding and subtracting the resulting
E[Q]  c( 2)  (k 1), product from T*. In general, the computations follow
where equations 14.1 to 14.5, except that the following uncondi-
tional variance estimate is used in equations 14.2 through

[ / ]
k k k
c  wi  w w 2
i i . 14.5 in place of the conditional variances outlined in the
i 1 i 1 i 1 fixed-effects models:
Solving for 2 and substituting Q for its expectation
vi*  2 vi, (14.24)
gives another unbiased estimator of the variance compo-
nent: where 2 is the variance component estimate yielded
 [Q  (k 1)]/c.
2
(14.23) by equations 14.22 or 14.23 and vi is defined by 14.7,

14.8, 14.10, 14.12, 14.13, 14.16, or 14.18, as appropriate.
If homogeneity is rejected, this estimate of the vari- The square root of the variance component describes the
ance component will be positive, since Q must then be standard deviation of the distribution of effect parame-
bigger than k 1. ters. Multiplying that standard deviation by 1.96 (or an
272 STATISTICALLY COMBINING EFFECT SIZES

appropriate critical value of t), and adding and subtract-__ the random-effects model than under the fixed-effects.
ing the result from the average random-effect size T*, Depending on how large this change is, it is theoretically
yields the limits of an approximate 95 percent confidence possible to obtain a significant random-effects mean effect
interval. All these random-effects analyses assume that size when the fixed effects counterpart is not significant.
the random effects are normally distributed with constant Finally, because the weights in equations 14.1 and 14.2
variance, an assumption that is particularly difficult to as- under the random-effects model are computed using the
sess when the number of studies is small. unconditional variance in equation 14.24, these weights
In section 14.2, after introducing equation 14.5, we will tend toward equality over studiesno matter how
noted that Joachim Hartung and Guido Knapp proposed a unequal their within-study sample sizeas study hetero-
modification to random-effects estimates of significance geneity, and thus the variance component, increases.
tests and confidence intervals to take into account the un-
certainty of sample estimates of the conditional variance 14.3.1 Random EffectsStandardized
and the variance component (2001). Essentially, the mod- Mean Differences
ification presents a weighted version of equation 14.20:
k __
Estimating the conditional variance of di with equation
w
k 1 w (Ti  T ) ,
1
s2(T)  _____ __i 2 14.7 allows us to solve equations 14.20 through 14.24 to
i 1 estimate the variance component 2. If the variance com-
where w is the summation of all the individual wi. This ponent is larger than zero, then the weighted average of
estimate is then used in equation 14.22 and subsequent the random effects is computed using 14.1 with weights
computations of significance tests and confidence inter- as computed by equation 14.2, but where the variance for
vals, both using critical values of t at df  k 1 rather the denominator of 14.2 is estimated as equation 14.24:
than z. Doing so will decrease the power of these tests to
vi*  2  vi ,
reject the null hypothesis, but may result in more accurate
coverage of type I and II error rates. Chapter 15 in this where vi is defined as 14.7. Similarly, the variance of the
volume presents more details of this approach in the con- estimate of
 is given by equation 14.3 but using the esti-
text of fixed-effects categorical and regression analyses. mate of the unconditional variance, vi*; using the resulting
In summary, we can now see why the random effects variance estimate yields the 95 percent confidence inter-
model usually provides more conservative estimates of vals and Z test as defined in equations 14.4 and 14.5. If
the significance of average effect sizes over studies in the the variance component exceeds zero, the random-effects
presence of unexplained heterogeneity. Under both fixed- unconditional variance vi* will be larger than the fixed-
and random-effects models, significant unexplained het- effects variance vi, so the variance of the estimate and the
erogeneity results in rejection of Q, because Q is exactly confidence intervals will both be larger under random-
the same under both models. The variance component is effects models. Thus, Z will usually be smaller, except for
an index of how much variance remained unexplained. In cases like those described at the end of the previous sec-
the fixed-effects model, this unexplained variance is ig- tion, where it is theoretically possible for an increase in
nored in the computation of standard errors and confi- the random effects mean to offset an increase in the vari-
dence intervals. By contrast, in the random-effects model ance enough to increase Z in the random-effects model.
the unexplained heterogeneity represented by the vari- We apply these procedures to data set II, using the data
ance component is added to the conditional variance, and from nineteen studies on the effects of teacher expectancy
the sum of these is then used in computing standard errors on student IQ. Equation 14.20 yields a sample estimate of
and confidence intervals. With more variance, the latter the variance of s2(d)[2.8273(3.11)2/19]/18  0.1288.
will be usually larger, which often means the random- We then estimate
effects test is usually less likely to reject the null hypoth- k
esis than the fixed effects test. (1/k) 2(di  i)
However, the latter is not always true because the esti- i 1
mate of the mean effect under the random-effects model by averaging the vi as defined by equation 14.7 to obtain
also changes. In some circumstances, for example when k
the smallest effect sizes are associated with the largest (1/k)vi  0.04843,
sample sizes, the average effect size can be larger under i 1
COMBINING ESTIMATES OF EFFECT SIZE 273

and solve (14.22) for a variance component estimate of somewhat smaller than the variance component estimated
using equation 14.22. Recomputing the aggregated statis-
2  0.1288  0.04843  0.082, tics yields an average population effect size of
__
which is significantly different from zero because Q  *  28.68/321.08  0.0893,
35.83 exceeds 34.81, the 99 percent critical value of the
chi-square distribution for k  1  19 1  18 df (note with a standard error  (1/321.08)1/2  0.0558. The limits
that this is the same Q for homogeneity under a fixed- of the 95 percent confidence interval are L  0.089 
effects model). Recomputing the aggregate statistics under 1.96  0.0558  0.020 and U  0.089 1.96  0.0558 
the random effects__model then yields an average popula- 0.199, and Z  0.0893/0.0558 1.60.
tion effect size of d* 18.22/159.36  0.114, with a stan-
dard error  (1/159.36)1/2  0.079.
The limits of the 95 percent confidence interval is 14.3.2 Random EffectsCorrelations
14.3.2.1 Random Effectsz-Transformed Correla-
L  0.114 1.96  0.079  0.04
tions If we follow the z-transform procedure outlined in
and section 14.2.2.1, then equation 14.8 defines the condi-
tional variance of the individual z statistics, allowing us
U  0.114 1.96  .079  0.27, to solve equations 14.20 to 14.24. We apply this proce-
dure to data set III, which reported correlations from
and twenty studies between overall student ratings of the in-
__ structor and student achievement. Equation 14.20 yields
Z  d*/(v*1/2)  .114/0.0791.44, a sample estimate of the variance of
which does not exceed 1.96, so we do not reject the hy- s2(z)  0.0604.
pothesis that the average effect size in the population is
k
zero. These results suggest that though the central ten-
We then estimate (1/k) 2(zi  i) by averaging the vi as
dency over the nineteen studies is rather small at .114, i 1 k
population effect sizes do vary significantly from that ten- defined by 14.8 to obtain (1/k)vi  0.0640, and solve
dency. Note the much wider confidence interval relative i 1
equation 14.22 for a variance component estimate of 2
to the fixed-effects model analysis, reflecting the greater
0.0604  0.0640  0.0036, Q  20.94 (the same Q for
inherent variability of effect sizes under the random-
homogeneity under a fixed effects model), which at
effects model.
k 1  19 df is not significant ( p  0.34). Because this
Finally, the square root of the variance component in
variance component is negative, we set it equal to zero by
this data is 0.29, which we can take (under the untestable
convention. However, because Q is not significant, we
assumption of a normal distribution of effect sizes in the
prefer the weighted variance method (equation 14.23).
population) as the standard deviation of the population ef-
This yields c  530  27334/530  478.43, and then the
fect sizes. Multiplying it by 1.96 yields a 95 percent con-
variance component is 2[20.94 19]/478.43  0.0041,
fidence interval bounded by 0.45 to 0.67, suggesting that
also not significantly different from zero by the same Q
most effect sizes in the population probably fall into this
test. When the variance component is not significantly
range. Indeed, this mirrors the data in the sample, with all
different from zero, we will not recompute average effect
but two study effect sizes falling within this range.
sizes, standard errors, and confidence intervals; but the
However, in this data set Q  35.83  3(k  1), so
reader should recall our previous comments that some
following Friedmans guidelines, we would prefer the
analysts would choose to compute all these random effect
weighted variance model of equation 14.23. This yields
statistics even though the variance component is not sig-
c  751.2075  47697.39/751.2075  687.71, nificantly different from zero.
14.3.2.2 Random EffectsCorrelation Coefficients
and then the variance component is Directly If we choose to combine correlation coeffi-
cients, the conditional variance of ri is defined in 14.20 to
2  [(35.83  18)/687.71]  0.026,
14.24. We apply this procedure to data set III. Equation
274 STATISTICALLY COMBINING EFFECT SIZES

14.20 yields a sample estimate of the variance of the incidence data in table 14.1 Equation 14.20 yields a
s2(r) (1/19)[3.4884  (7.28)2/20]  0.0441. Using equa- sample estimate of variance of s2(l )  (1/4)[6.604 
tion 14.22, we then obtain a variance component of 2  (4.72)2/5]  0.537. Using equation 14.22, we then obtain
0.0441 0.0374  0.0067, which is not significantly dif- a variance component estimate of 2  .537  0.032 
ferent from zero Q  25.631, p  0.14. Again, because Q 0.505, which at k 1  4 df is significantly different
is not significant, we prefer the weighted variance method from zero, Q  77.2626. Recomputing the average _ effect-
of 14.23, which yields c  847.18  (60444.68/847.18)  size statistics yields an average log odds ratio l*  8.75/
775.83 and a variance component of 2  (25.63119)/ 9.32 0.94, a variance v* 1/9.32  0.107, standard error 
775.83  0.0085, also nonsignificant by the same Q test. 0.328, and a 95 percent confidence interval bounded by
L 0.94 1.96  0.328)  0.30 and U  0.94 1.96 
14.3.3 Random EffectsFourfold Tables 0.328) 1.58. As in section 14.2.3, we then reconvert
the common log odds ratios __
into the_ more interpretable
14.3.3.1 Random EffectsDifferences Between Pro- common odds ratio as o*  antilog( l*)  2.56, and L 
portions In the case of rate differences Di  pi1  pi2, the antilog (L)  1.35 and U  antilog(U)  4.85 are the
conditional variance is defined in 14.12. As an example, bounds of the 95 percent confidence interval.
we apply equations 14.20 to 14.24 to the incidence data Using equation 14.23, the weighted variance method
in table 14.1. Equation 14.20 yields a sample estimate of again yields similar results, with c  230.94  (17566.09/
the variance of s2(D)  (1/4)[0.0408  (0.3065)2/5]  230.94)  154.88, which then yields a variance compo-
0.0055. Using equation 14.22, we then obtain a variance nent estimate of 2  (77.264)/154.880.473, also sig-
component of 2  0.0055  0.0002  0.0053, which at nificant by the same Q test. Here we would again prefer
k 1  4 df is significantly different from zero, Q  the previous estimate from (14.22) because Q  77.26
562.296. Recomputing the aggregate__effect size statistics 3(k 1). However, to complete the example, recomputing
yields an average rate difference D*  53.00/912.49  the average_ effect-size statistics yields an average log
0.058, a variance v*  1/912.49  0.0011, standard error  odds ratio l*  9.30/9.90  0.94, a variance v* 1/9.90 
0.033, with a 95 percent confidence interval bounded by 1.02, standard error  0.319, and a 95 percent confidence
L  0.058  1.96  0.033  0.0069 and U  0.058  interval bounded by L  0.94 1.96  0.319  0.32 and
1.96  0.033  0.12. U  0.94 1.96  0.3191.56. As in section 14.2.3, we
Using equation 14.23, the weighted variance method then reconvert the common log odds __
ratio into _the more
yields c  393167.45, which then yields a variance com- interpretable common odds ratio as o*antilog(l*)2.56,
ponent estimate of 2  (562.296  4)/c  .0014, also sig- and the 95 percent confidence interval is then bounded by
nificant by the same Q test. Recomputing the average ef- L  antilog(L)  1.37 and U  antilog(U)  4.73.
fect size statistics
__ yield an average difference between
proportions of D.166.75/3170.42.053, a variance v.  14.3.4 Combining Differences Between
1/3170.42  .0003, a standard error  .018, and a 95 per- Raw Means
cent confidence interval bounded by L  .053  (1.96)
(.018)  .018 and U  .053  (1.96)(.018)  .087. The Equation 14.18 gives the conditional variance for differ-
weighted and unweighted methods yield very similar re- ences between two independent means. Using that esti-
sults; however, if we follow Friedmans guidelines (2000), mate in equations 14.20 to 14.24 yields the random-effects
we would prefer the estimate from equation 14.22 be- solution. For the data in table 14.4, equation 14.20 yields
cause Q  562.296 3(k 1). This preference makes a a sample estimate of the variance of s2(T)  (1/7)[15.59 
difference here because the 95 percent confidence interval (8.42)2/8]  0.96. Using equation 14.22, we then obtain a
includes zero under 14.22 but excludes it under 14.23. variance component of 2  0.96 1.01 0.05, which
14.3.3.2 Random EffectsOdds and Log Odds by convention we round to zero; since Q  13.92, and
Ratios The methods outlined in this section can be used p  0.053, however, this is very close to significance.
to test random effects models in both cases discussed Using equation (14.23), the weighted variance method
by Fleisscombining log odds ratios, and combining yields c  27.24  (138.75/27.24)  22.15, which then
odds ratios using Mantel-Haenszel methods. We use 2  (13.92  7)/22.15  0.31, which is preferable
yields
equation 14.13 to estimate the conditional variance of the according the Friedmans rules. Recomputing the aver-
log odds ratio and then use equations 14.20 to 14.24 with age effect-size statistics under the random-effects model
COMBINING ESTIMATES OF EFFECT SIZE 275

for the unweighted__variance estimate yields an average and the results so irregular, that it must be doubtful how
mean difference of T*  18.41/37.94  0.49 days, with much weight is to be attributed to the different results
v* 1/37.94  0.027, standard error  0.16, and a 95 per- (1904b, 1243). He went on to outline a host of possible
cent confidence interval bounded by L  0.49 1.96  sources of heterogeneity. For example, some of the data
0.16  0.80 and U  0.49 1.96  0.16 0.13. Re- came from army units which for a variety of reasons he
computing these same
__ statistics under the weighted vari- thought seem hardly comparable (1243); for example,
ance model yields T*  7.66/11.25  0.68 days, with the efficiency of nursing . . . must vary largely from hos-
v* 1/11.25  0.09, standard error  0.30, and a 95 per- pital to hospital under military necessities (1904a, 1663).
cent confidence interval bounded by L  0.68 1.96  He also noted that the samples differed in ways that might
0.30  1.27 and U  0.68 1.96  0.30  0.09. affect study outcome: the single regiments were very
rapidly changing their environment . . . the Ladysmith
garrison least, and Methuens column more mobile than
14.4 COMPUTER PROGRAMS the Ladysmith garrison, but far less than the single regi-
ments (1904b, 1244). Because it is far more easy to
Since the first edition of this handbook appeared more
be cautious under a constant environment than when the
than ten years ago, computer programs for doing meta-
environment is nightly changing, it might be that the
analysis have proliferated, including those that will do
apparent correlation between inoculation and immunity
the computations described in this chapter. The programs
arose from the more cautious and careful men volunteer-
range from freeware to shareware to commercial products,
ing for inoculation (1904b, 1244). Thus he called for
with the latter including both programs devoted exclu-
better-designed experimental inquiry, using volunteers,
sively to meta-analysis and additions to standard compre-
but while keeping a register of all men who volunteered,
hensive statistical packages. The first author maintains a
only to inoculate every second volunteer (1245); recall
web page with links to many such programs, along with
that Fishers work on randomized experiments had not
a brief description of each (http://faculty.ucmerced.edu/
yet been published (1925, 1926).
wshadish/Meta-Analysis%20Software.htm).
Wright, too, cited various sources of heterogeneity,
noting that the validity of typhoid diagnosis was undoubt-
14.5 CONCLUSIONS edly highly variable over units, partly because under or-
dinary circumstances the only practically certain criterion
We began this chapter with a discussion of the debate of typhoid fever is to be found in the intestinal lesions
between Pearson and Wright about the effectiveness of which are disclosed by post-mortem examination (1904a,
the typhoid vaccine. Hence it is fitting to return to this 1490). Such autopsies were not routine in all units, so
topic as we conclude. If we were to apply the methods measurements of fatality in some units were probably
outlined in section 14.2.2.1 to the data in table 14.1, we more valid than in others. Moreover, autopsy diagnosis is
would find that the typhoid vaccine is even less effective definitive, whereas physician diagnosis of incidence of
than Pearson concluded. For example, the weighted aver- typhoid was very unreliable. For these and other reasons,
age tetrachoric
__
correlation between immunity and inocu- Wright questioned the validity of measurement in these
lation is   0.132(Z  56.41) and the weighted average studies, chiding Pearson for neglecting to make inquiry
tetrachoric correlation between inoculation __
and survival into the trustworthiness of . . . the comparative value of a
among those who contract typhoid is   0.146(Z  death-rate and an incidence-rate of the efficacy of an in-
17.199). Both these averages are smaller than the un- oculation process (1904b, 1727).
weighted averages reported by Pearson of 0.23 and 0.19, Methods for exploring such sources of heterogeneity
respectively. It is ironic in retrospect, then, that Wright are the topic of the next several chapters of this book.
was right: subsequent developments showed that the vac- Unfortunately, neither Pearson nor Wright provided more
cine works very well indeed. precise information about which studies were associated
Yet this possibility was not lost on Pearson. Anticipat- with which sources of heterogeneity in the data they ar-
ing informally the results of the formal Q tests, which re- gued about. Hence, use of the Pearson-Wright example
ject homogeneity of effect size for these data (Q  1740.36, ends here. But the issues are timeless. Pearson and Wright
and Q  76.833, respectively), Pearson said early in his understood them over a century ago; they lacked only the
report that the material appears to be so heterogeneous, methods to pursue them.
276 STATISTICALLY COMBINING EFFECT SIZES

Guo, Jianhua, Yanping Ma, Ning-Zhong Shi, and Tai Shing


14.6 REFERENCES
Lau. 2004. Testing for Homogeneity of Relative Differen-
Berlin, Jesse A., Nan M. Laird, Henry S. Sacks, and Tho- ces under Inverse Sampling. Computational Statistics and
mas C. Chalmers. 1989. A Comparison of Statistical Meth- Data Analysis 44(4): 513624.
ods for Combining Event Rates from Clinical Trials. Sta- Haddock, C. Keith, David Rindskopf, and William R. Sha-
tistics in Medicine 8(2): 14151. dish. 1998. Using Odds Ratios as Effect Sizes for Meta-
Blitstein, Jonathan L., David M. Murray, Peter J. Hannan Analysis of Dichotomous Data: A primer on Methods and
and William R Shadish,. 2005. Increasing the Degrees of Issues. Psychological Methods, 3(3): 33953.
Freedom in Existing Group Randomized Trials: The df* Hardy, Rebecca J., and Simon G. Thompson. 1998. Detec-
Approach. Evaluation Review 29(3): 24167. ting and Describing Heterogeneity in Meta-Analysis. Sta-
Bohning, Dankmar, Uwe Malzahn, Ekkehart Dietz, Peter tistics in Medicine 17(8): 84156.
Schlattmann, Chukiat Viwatwongkasem, and Annibale Hartung, Joachim, and Guido Knapp. 2001. On Tests of the
Biggeri. 2002. Some General Points in Estimating Hetero- Overall Treatment Effect in Meta-Analysis with Normally
geneity Variance with the DerSimonian-Laird Estimator. Distributed Responses. Statistics in Medicine 20(12):
Biostatistics 3(4): 44557. 177182.
Boissel, Jean-Pierre, Jean Blanchard, Edouard Panak, Jean- Harwell, Michael. 1997. An Empirical Study of Hedges
Claude Peyrieux, and Henry Sacks. 1989. Considerations homogeneity Test. Psychological Methods 2(2): 21931.
for the Meta-Analysis of Randomized Clinical Trials. Hedges, Larry V. 1983. A Random Effects Model for Effect
Controlled Clinical Trials 10(3): 25481. Sizes. Psychological Bulletin 93(2): 33895.
Bollen, Kenneth A. 1989. Structural Equations with Latent Hedges, Larry V., and Ingram Olkin. 1985. Statistical Methods
Variables. New York: John Wiley & Sons. for Meta-Analysis. Orlando, Fl.: Academic Press.
Chalmers, Thomas C., Henry Smith Jr., Bradley Blackburn, Hedges, Larry V., and Therese D Pigott. 2001. The Power of
Bernard Silverman, Biruta Schroeder, Dinah Reitman, and Statistical Tests in Meta-Analysis. Psychological Methods
Alexander Ambroz. 1981. A Method for Assessing the 6(3): 20317.
Quality of a Randomized Control Trial. Controlled Clini- Hedges, Larry V., and Jack L. Vevea. 1998. Fixed- and
cal Trials 2(1): 3149. Random-Effects Models in Meta-Analysis. Psychological
Cowlesm Michael. 1989. Statistics in Psychology: An Histo- Methods 4(3): 486504.
rical Perspective. Hillsdale, N.J.: Lawrence Erlbaum. Heinsman, Donna T., and William R. Shadish. 1996.
DerSimonian, Rebecca, and Nan Laird. 1986. Meta-Ana- Assignment Methods in Experimentation: When do Non-
lysis in Clinical Trials. Controlled Clinical Trials 7(3): randomized Experiments Approximate the Answers from
17788. Randomized Experiments? Psychological Methods 1(2):
Field, Andy P. 2005. Is the Meta-Analysis of Correlation 15469.
Coefficients Accurate when Population Correlations Vary? Higgins, Julian P. T., and Simon G. Thompson. 2002. Quan-
Psychological Methods 10(4): 44467. tifying Heterogeneity in a Meta-Analysis. Statistics in Me-
Fisher, Ronald A. 1925. Statistical Methods for Research dicine 21(11): 153958.
Workers. Edinburgh: Oliver and Boyd. Higgins, Julian P. T., Simon G. Thomson, Jonathan J. Deeks,
Fisher, Ronald A. 1926. The Arrangement of Field Experi- and Douglas G. Altman. 2003. Measuring Inconsistency
ments. Journal of the Ministry of Agriculture of Great Bri- in Meta-Analyses. British Medical Journal 327(6 Septem-
tain 33: 50513. ber): 55760.
Fleiss, Joseph L. 1981. Statistical Methods for Rates and Huedo-Medina, Tania B., Julio Sanchez-Meca, J., Fulgencio
Proportions, 2nd ed.. New York: John Wiley & Sons. Marin-Martinez, and Juan Botella. 2006. Assessing Hete-
Fleiss, Joseph L., and Alan J. Gross. 1991. Meta-Analysis rogeneity in Meta-Analysis: Q Statistic or I 2 Index? Psy-
in Epidemiology, with Special Reference to Studies of the chological Methods 11(2): 193206.
Association Between Exposure to Environmental Tobacco Hunter, Jack E., and Frank L. Schmidt. 2004. Methods of
Smoke and Lung Cancer: A Critique. Journal of Clinical Meta-Analysis: Correcting Errors and Bias in Research
Epidemiology 44(2): 12739. Findings 2nd ed. Thousand Oaks, Calif.: Sage Publications.
Friedman, Lynn. 2000. Estimators of Random Effects Va- Jackson, Dan. 2006. The Power of the Standard Test for the
riance Components in Meta-Analysis. Journal of Educa- Presence of Heterogeneity in Meta-Analysis. Statistics in
tional and Behavioral Statistics 25(1): 112. Medicine 25(15): 268899.
COMBINING ESTIMATES OF EFFECT SIZE 277

Jones, Michael P., Thomas W. OGorman, T. W., Jon H. . 1904b. Report on Certain Enteric Fever Inocula-
Lemke, J. H., and Robert F. Woolson. 1989. A Monte tion Statistics. British Medical Journal 2: 124346.
Carlo Investigation of Homogeneity Tests of the Odds Ratio Raudenbush, Stephen W., and Anthony S. Bryk. 2001. Hie-
under Various Sample Size Configurations. Biometrics rarchical Linear Models, 2nd ed. Thousand Oaks, Calif.:
45(1): 17181. Sage Publications.
Jni, Peter, Anne Witschi, Ralph Bloch, and Matthias Egger. Reis, Isildinha M., Karim F. Kirji, and Abdelmonem A. Afifi.
1999. The Hazards of Scoring the Quality of Clinical 1999. Exact and Asymptotic Tests for Homogeneity in Se-
Trials for Meta-Analysis. Journal of the American Medi- veral 2  2 Tables. Statistics in Medicine 18(8): 893906.
cal Association, 282(11): 105460. Roberts, Royston M. 1989. Serendipity: Accidental Discove-
Knapp, Guido, Brad J. Biggerstaff and Joachim Hartung. ries in Science. New York: John Wiley & Sons.
2006. Assessing the Amount of Heterogeneity in Ran- Robins, James, Norman Breslow, and Sander Greenland.
dom-Effects Meta-Analysis. Biometrical Journal 48(2): 1986. Estimators of the Mantel-Haenszel Variance Consis-
27185. tent in Both Sparse Data and Large-Strata Limiting Mo-
Kulinskaya, E., M. B. Dollinger, E. Knight, and H. Gao. dels. Biometrics 42(2): 31123.
2004. A Welch-type test for homogeneity of contrasts Robins, James, Sander Greenland, and Norman Breslow.
under heteroscedasticity with application to meta-analysis. 1986. A General Estimator for the Variance of the Mantel-
Statistics in Medicine 23(23): 365570. Haenszel Odds Ratio. American Journal of Epidemiology
Kuritz. Stephen J., J. Richard Landis, and Gary G. Koch. 124: 71923.
1988. A General Overview of Mantel-Haenszel Methods: Shadish, William R., and Kevin Ragsdale. 1996. Random
Applications and Recent Developments. Annual Review of Versus Nonrandom Assignment in Psychotherapy Experi-
Public Health 9: 12360. ments: Do You Get the Same Answer? Journal of Consul-
Law, Kenneth S. 1995. Use of Fishers Z in Hunter-Sch- ting and Clinical Psychology 64(6): 12901305.
midtType Meta-Analyses. Journal of Educational and Spector, Paul E., and Edward L. Levine. 1987. Meta-Ana-
Behavioral Statistics 20(3): 287306. lysis for Integrating Study Outcomes: A Monte Carlo Study
Lipsey, Mark W., and David B. Wilson. 1993. The Efficacy of its Susceptibility to Type I and Type II Errors. Journal
of Psychological, Educational, and Behavioral Treatment: of Applied Psychology 72(1): 39.
Confirmation from Meta-Analysis. American Psycholo- Susser, Mervyn. 1977. Judgment and Causal Inference:
gist 48(2): 11811209. Criteria in Epidemiologic Studies. American Journal of
Mumford, Emily, Herbert J. Schlesinger, Gene V. Glass, Epidemiology 105(1): 115.
Cathleen Patrick, and Timothy Cuerdon. 1984. A New Takkouche, Bahi, Carmen Cadarso-Suarez, and Donna
Look at Evidence about Reduced Cost of Medical Utiliza- Spiegelman. 1999. Evaluation of Old and New Tests of
tion Following Mental Health Treatment. American Jour- Heterogeneity in Epidemiological Meta-Analysis. Ameri-
nal of Psychiatry 141(10): 114558. can Journal of Epidemiology 150(2): 20615.
Paul, Sudhir, and Alan Donner. 1989. A Comparison of Wright, Almroth E. 1904a. Antityphoid Inoculation. Bri-
Tests of Homogeneity of Odds Ratios in k 2  2 Tables. tish Medical Journal 2(JulDec): 148991.
Statistics in Medicine 8(12): 145568. . 1904b. Antityphoid Inoculation. British Medical
Paul, Sudhir, and S. Thedchanamoorthy. 1997. Likelihood- Journal 2(JulDec): 1727.
Based Confidence Limits for the Common Odds Ratio. Viechtbauer, Wolfgang. 2005. Bias and Efficiency of
Journal of Statistical Planning and Inference 64(1): 8392. Meta-Analytic Estimators in the Random-Effects Model.
Pearson, Karl. 1904a. Antityphoid Inoculation. British Journal of Educational and Behavioral Statistics 30(3):
Medical Journal 2(2296): 166768. 26193.
15
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS

SPYROS KONSTANTOPOULOS LARRY V. HEDGES


Boston College Northwestern University

CONTENTS
15.1 Introduction 280
15.2 Analysis of Variance for Effect Sizes 280
15.2.1 The One-Factor Model 280
15.2.1.1 Notation 281
15.2.1.2 Means 281
15.2.1.3 Tests of Heterogeneity 282
15.2.1.4 Computing the One-Factor Analysis 282
15.2.1.5 Standard Errors of Means 285
15.2.1.6 Tests and Confidence Intervals 285
15.2.1.7 Comparisons or Contrasts Among Mean Effects 286
15.2.2 Multifactor Models 288
15.2.2.1 Computation of Omnibus Test Statistics 288
15.2.2.2 Multifactor Models with Three or More Factors 288
15.3 Multiple Regression Analysis for Effect Sizes 289
15.3.1 Models and Notation 289
15.3.2 Estimation and Significance Tests 289
15.3.3 Omnibus Tests 290
15.3.4 Collinearity 291
15.4 Quantifying Explained Variance 292
15.5 Conclusion 293
15.6 References 293

279
280 STATISTICALLY COMBINING EFFECT SIZES

15.1 INTRODUCTION is explained by the data analysis model. These tests can
be conceived as tests of model specification. If a fixed-
A central question in research synthesis is whether meth- effects model explains all of the variation in effect-size
odological, contextual, or substantive differences in re- parameters, the (fixed-effect) model is unquestionably
search studies are related to variation in effect-size pa- appropriate. Models that are well specified can provide
rameters. a strong basis for inference about effect sizes in fixed-
Both fixed- and random-effects statistical methods are effects models, but are not essential for inference from
available for studying the variation in effects. The choice them. If differences between studies that lead to differ-
of which to use is sometimes a contentious issue in both ences in effects are not regarded as random (for exam-
meta-analysis as well as primary analysis of data. The ple, if they are regarded as consequences of purposeful
choice of statistical procedures should primarily be deter- design decisions) then fixed effects methods may be ap-
mined by the kinds of inference the synthesist wishes to propriate for the analysis. Similarly fixed-effects analy-
make. Two different inference models are available, some- ses are appropriate if the inferences desired are regarded
times called conditional and unconditional inference (see as conditionalapplying only to studies like those under
Hedges and Vevea 1998). The conditional model attempts examination.
to make inference about the relation between covariates
and the effect-size parameters in the studies that are ob-
served. In contrast, the unconditional model attempts to 15.2 ANALYSIS OF VARIANCE FOR EFFECT SIZES
make inferences about the relation between covariates and
One of the most common situations in research synthesis
the effect-size parameters in the population of studies
arises when the effect sizes can be sorted into indepen-
from which the observed studies are a representative sam-
dent groups according to one or more characteristics of
ple. Fixed-effects statistical procedures are well suited to
the studies generating them. The analytic questions are
drawing conditional inferences about the observed studies
whether the groups (average) population effect sizes vary
(see, for example, Hedges and Vevea 1998). Random- or
and whether the groups are internally homogeneous, that
mixed-effects statistical procedures are well suited to
is, whether the effect sizes vary within the groups. Alter-
drawing unconditional inferences (inferences about the
natively, we could describe the situation as one of explor-
population of studies from which the observed studies are
ing the relationship between a categorical independent
a sample. Fixed-effects statistical procedures may also be
variable (the grouping variable) and effect size.
a reasonable choice when the number of studies is too
Many experimental social scientists have encountered
small to support the effective use of mixed- or random-
the statistical method of analysis of variance (ANOVA) as
effects models.
a way of examining the relationship between categorical
In this chapter, we present two general classes of fixed-
independent variables and continuous dependent variables
effects models. One class is appropriate when the inde-
in primary research. Here we describe an analogue to the
pendent (study characteristic) variables are categorical.
analysis of variance for effect sizes in the fixed-effect
This class is analogous to the analysis of variance, but is
case (for technical details, see Hedges 1982b; Hedges and
adapted to the special characteristics of effect-size esti-
Olkin 1985).
mates. The second class is appropriate for either discrete
or continuous independent (study characteristic) vari-
ables and therefore technically includes the first class as a 15.2.1 The One-Factor Model
special case. This second class is analogous to multiple
regression analysis for effect sizes. In both cases, we de- Situations frequently arise in which we wish to determine
scribe the models along with procedures for estimation whether a particular discrete characteristic of studies is
and hypothesis testing. Although some formulas for hand related to effect size. For example, we may want to know
computation are given, we stress computation using widely if the type of treatment is related to its effect or if all
available packaged computer programs. variations of the treatment produce essentially the same
Tests of goodness of fit are given for each fixed-effect effect. The analogue to one-factor analysis of variance for
model. They test the notion that there is no more variabil- effect sizes is designed to answer just such questions. We
ity in the observed effect sizes than would be expected if present this analysis in somewhat greater detail than its
all (100 percent) of the variation in effect size parameters multifactor extensions both because it illustrates the logic
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 281

of more complex analyses and because it is among the Table 15.1 Effect Size Estimates and Sampling
simplest and most frequently used methods for analyzing Variances for p Groups of Studies
variation in effect sizes.
15.2.1.1 Notation In the discussion of the one-factor Effect Size Estimates Variances
model, we use a notation emphasizing that the indepen-
dent effect size estimates fall into p groups, defined a Group 1
Study 1 T11 v11
priori by the independent (grouping) variable. Suppose Study 2 T12 v12
that there are p distinct groups of effects with m1 effects . . .
in the first group, m2 effects in the second group, . . . , and . . .
mp effects in the pth group. Denote the jth effect parame- . . .
ter in the ith group by ij and its estimate by Tij with (con- Study m1 T1m v1m
variance vij. That is, Tij estimates ij with standard
1 1
ditional)
__ Group 2
error vij . In most cases, vij will actually be an estimated Study 1 T21 v21
variance that is a function of the within-study sample size Study 2 T22 v22
and the effect-size estimate in study j. However, unless . . .
the within-study sample size is exceptionally small, we . . .
can treat vij as known. Therefore, in the rest of this chap- . . .
ter we assume that vij is known for each study. The sample Study m2 T2m 2
v2m 2

data from the collection of studies can be represented as Group p


in table 15.1. Study 1 Tp1 vp1
15.2.1.2 Means Using the dot notation from the anal- Study 2 Tp2 vp2
ysis of variance, . . .
__ define the group mean effect estimate for
the ith group Ti by . . .
mi
. . .

__ w T ij ij
Study mp Tpm p
vpm p

j 1
T i  ______
m ,
i
i  1, . . . , p (15.1) source: Authors compilation.
w ij
j 1

where the weight wij is simply the reciprocal of the vari- __


ance of Tij, Thus, Ti is simply the weighted mean that would be
computed by__ applying formula (15.1) to the studies in
wij  1/vij. (15.2) group i and T is the weighted mean that would be ob-
__ tained by applying formula (15.1) to all of the studies. If
The grand weighted mean T is
p mi
all of the studies in group i estimate a common effect-size __
w T ij ij
parameter i, that is if i1  i2  . . .  im  i, then Ti i
__ i 1 j 1 estimates i. If the studies within the ith group do not es-
T  ________
p m
. (15.3) __
i
timate a common effect parameter, then Ti estimates the
w
i 1 j 1
ij
weighted mean of the effect-size parameters ij given by
__
The grand mean T could
__ also be__ seen as the weighted
mi

mean of the group means T1, . . . , Tp __ w


ij ij
j 1
______
p __
i  m i
, i 1, . . . , p. (15.4)
__ w T i i wij
j 1
T  ______
i 1
p
,
w i Similarly, if all of the
__ studies in the collection estimate
j 1
a common parameter , that is, if
where the weight wi is just the sum of the weights for the
__
ith group 11  . . .  1m  21  . . .  pm  ,
1 p
__ __
wi  wi1  . . .  wim . i
then T estimates  .
282 STATISTICALLY COMBINING EFFECT SIZES
__
If the studies do not all estimate the parameter,__then T null hypothesis of no variation across group mean effect
can be seen as an estimate of a weighted mean  of the sizes is true, QB has a chi-square distribution with ( p 1)
effect parameters given by degrees of freedom. Hence we test H0 by comparing the
p mi obtained value of QB with the upper tail critical values
__ w
ij ij
i 1 j 1
of the chi-square distribution with ( p  1) degrees of
  ________ . (15.5) freedom. If QB exceeds the 100(1  ) percent point of
p mi

ij
w
i 1 j 1
the chi-square distribution (for example, C.05  18.31 for
__ 10 degrees of freedom and   0.05), H0 is rejected at
Alternatively,
__  can be viewed as a weighted mean of level  and between group differences are significant.
the i: This test is analogous to the omnibus F-test for varia-
p __ tion in group means in a one-way analysis of variance in
__ w  i i a primary research study. It differs in that QB, unlike the
  ______
i 1
p
. 282

F-test, incorporates an estimate of unsystematic error in


w
j 1
i the form of the weights. Thus no separate error term (such
as the mean square within groups as in the typical analy-
where wi is just the sum of the weights __ wij for the ith sis of variance) is needed and the sum of squares can be
group as in the alternate expression for T above. used directly as a test statistic.
15.2.1.3 Tests of Heterogeneity In the analysis of The test based on equation 15.6 depends entirely on
variance, tests for systematic sources of variance are con- theory that is based on the approximate sampling distri-
structed from sums of squared deviations from means. butions of the Tij, and on the fact that any differences
That the effects attributable to different sources of vari- among the ij are not random.
ance partition the sums of squares leads to the interpre- An Omnibus Test for Within-Group Variation in Effects.
tation that the total variation about the grand mean is To test the hypothesis that there is no variation of popu-
partitioned into parts that arise from between-group and lation effect sizes within the groups of studies, that is
within-group sources. The analysis of variance for effect to test
sizes has a similar interpretation. The total heterogeneity __
statistic Q  QT (the weighted total sum of squares about 11  . . .  1m   1
1
__
the grand mean) is partitioned into a between-groups-of- 21  . . .  2m   2
studies part QB (the weighted sum of squares of group 2

means about the grand mean) and a within-groups-of- H0: . . .


studies part QW (the total of the weighted sum of squares . . .
of the individual effect estimates about the respective
group means). These statistics QB and QW yield direct om- . . .
__
nibus tests of variation across groups in mean effects and p1  . . .  pm   p
p
variation within groups of individual effects.
An Omnibus Test for Between-Groups Differences. To use the within-group homogeneity statistic QW given by
test the hypothesis that there is no variation in group p mi __
mean effect sizes, that is, to test QW  wij(Tij  T i)2 (15.7)
i 1 j1
__ __ __
H0 : 1 2  . . .  p, where the wij are the reciprocals of the vij, the sampling
variances of the Tij. When the null hypothesis of perfect
we use the between group heterogeneity statistic QB de- homogeneity of effect size parameters is true, QW has a
fined by chi-square distribution with (k  p) degrees of freedom
p __ __ where k  m1  m2  . . .  mp is the total number of
QB  wi(T iT )2 (15.6) studies. Therefore within-group homogeneity at signifi-
i 1 cance level  is rejected if the computed value of QW ex-
where wi is the reciprocal of the variance of vi. Note ceeds the 100(1  ) percent point (the upper tail critical
that QB is just the weighted sum of squares of group mean value) of the chi-square distribution with (k  p) degrees
effect sizes about the grand mean effect size. When the of freedom.
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 283

Although QW provides an overall test of within-group Table 15.2 Heterogeneity Summary Table
variability in effects it is actually the sum of p separate
(and independent) within-group heterogeneity statistics, Degrees of
one for each of the p groups of effects. Thus Source Statistic Freedom

QW  QW  QW  . . .  QW
1 2 p
(15.8) Between groups QBET p1
Within groups
where each QW is just the heterogeneity statistic Q given
i Within group 1 QW m11
in formula 14.6 of chapter 14 in this volume. In the nota- m21
1
Within group 2 QW 2
tion used here, QW is given by
i
. . .
mi __ . . .
QW  wij (Tij  T i)2. (15.9) . . .
i
j 1 Within group p QW p
. mp1
These individual within-group statistics are often use- Total within groups QW kp
ful in determining which groups are the major sources of Overall Q k1
within-group heterogeneity and which groups have rela-
tively homogeneous effects. For example in analyses of source: Authors compilation.
the effects of study quality in treatment/control studies, note: Here k  m1  m2  . . .  mp.
study effect estimates might be placed into two groups:
those from quasi-experiments and those from random- to select independent (grouping) variables that explain
ized experiments. The effect sizes within the two groups variation (heterogeneity) so that most of the total hetero-
might be quite heterogeneous overall, leading to a large geneity is between-groups and relatively little remains
QW, but most of that heterogeneity might arise within the within groups of effects. Of course, the grouping variable
group of quasi-experiments so that QW (the statistic for
1
must, in principle, be chosen before examination of the
quasi-experiments) would indicate great heterogeneity, effect sizes to ensure that tests for the significance of
but QW (the statistic for randomized experiments) would
2
group effects do not capitalize on chance.
indicate relative homogeneity (see Hedges 1983a). 15.2.1.4 Computing the One-Factor Analysis Al-
If the effect-size parameters within the ith group of though QB and QW can be computed using a computer
studies are homogeneous, that is if i1  . . .  im , then
i
program for weighted ANOVA, the weighted cell means
QW has the chi-square distribution with mi  1 degrees of
i
and their standard errors generally cannot. Computational
freedom. Thus the test for homogeneity of effects within formulas can greatly simplify direct calculation of QB and
the ith group at significance level  consists of rejecting QW as well as the cell means and their standard errors.
the hypothesis of homogeneity if Qw exceeds the 100i
These formulas are analogous to those used in analyzing
(1  ) percent point of the chi-square distribution with variance and enable all of the statistics to be computed in
(mi 1) degrees of freedom. one pass through the data (for example, by a packaged
It is often convenient to summarize the relationships computer program). The formulas are expressed in terms
among the heterogeneity statistics via a table analogous of totals (sums) across cases of the weights, of the weights
to an ANOVA source table (see table 15.2). times the effect estimates, and of the weights times the
Partitioning Heterogeneity. There is a simple relation- squared effect estimates. Define
ship among the total homogeneity statistic Q given in for-
TWi  j 1wij, TW  i 1TWi,
mi p
mula (15.6) and the between-and within-group fit statis-
tics discussed in this section. This relationship corresponds
TWDi  j 1wijTij , TWD  i 1TWDi,
mi p
to the partitioning of the sums of squares in ordinary
analysis of variance. That is,
TWDSi  j 1wijTij2, TWDS  i1TWDSi,
mi p

Q  Q B  QW .
where wij  1/vij is just the weight for Tij. Then the overall
One interpretation is that the total heterogeneity about heterogeneity statistic Q is
the mean Q is partitioned into between-group heteroge-
neity QB and within-group heterogeneity QW. The ideal is Q  TWDS  (TWD)2/TW.
284 STATISTICALLY COMBINING EFFECT SIZES

Table 15.3 Data for Male-Authorship Example

% Male
Study Authors Group # Items T v w wT wT 2

1 25 1 2 .33 .029 34.775 11.476 3.787


2 25 1 2 .07 .034 29.730 2.081 .146
3 50 2 2 .30 .022 45.466 13.640 4.092
4 100 3 38 .35 .016 62.233 21.782 7.624
5 100 3 30 .70 .066 15.077 10.554 7.388
6 100 3 45 .85 .218 4.586 3.898 3.313
7 100 3 45 .40 .045 22.059 8.824 3.529
8 100 3 45 .48 .069 14.580 6.998 3.359
9 100 3 5 .37 .051 19.664 7.275 2.692
10 100 3 5 .06 .032 31.218 1.873 0.112

source: Authors compilations.

Each of the within-group heterogeneity statistics is group paradigm. The effect sizes were standardized mean
given by differences which were classified into three groups on the
basis of the percentage of the authors of the research re-
QW  TWDSi  (TWDi )2/TWi,
i
i  1, . . . , p. port who were male. Group 1 consisted of two studies
with 25 percent male authorship, group 2 of a single study
The overall within group homogeneity statistic is with 50 percent, and group 3 of seven studies with all
obtained as QW  QW  QW  . . .  QW . The between
1 p
male authorship. The data are given in table 15.3.
groups heterogeneity statistic is obtained as QB  Q  QW.
2

Using formula 14.6 from chapter 14 of this volume),


The weighted overall mean effects and its variance are and the sums given in table 15.3, the overall heterogene-
__ ity statistic Q is
T  TWD / TW, v  1/TW,
Q  36.040  (34.423)2 / 279.388  31.799.
and the weighted group means and their variances are
Using the sums given in table 15.3, the within-group
vi  TWDi / TWi, i  1, . . . , p,
heterogeneity statistics QW , QW , and QW are
1 2 3

and
QW  3.933  (9.395)2/64.505  2.565,
1

vi 1/TWi, i 1, . . . , p. QW  4.092  (13.640)2/45.466  0.0,


2

The omnibus test statistics Q, QB, and QW can also be QW  28.015  (57.458)2/169.417  8.528.
3

computed using a weighted analysis of variance program.


The grouping variable is used as the factor in the weighted The overall within-group heterogeneity statistic is
ANOVA, the effect-size estimates are the observations, therefore
and the weight given to effect size Tij is just wij. The
weighted between-group or model sum of squares is ex- QW  2.565  0.0  8.528  11.093.
actly QB, the weighted within-group or residual sum of
squares is exactly QW, and the corrected total sum of Because 11.093 does not exceed 14.07, the 95 percent
squares is exactly Q. point of the chi-square distribution with 10  3  7 de-
Example. Alice Eagly and Linda Carli (1981), in data grees of freedom, we do not reject the hypothesis that the
set I, reported a synthesis of ten studies of gender differ- effect-size parameters are homogeneous within the groups.
ences in conformity using the so called fictitious norm In fact, a value this large would occur between 10 and 25
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 285

percent of the time due to chance even with perfect homo- Table 15.4 Quantities Used to Compute Male-
geneity of effect parameters. Thus we conclude that the Authorship Example Analysis
effect sizes are relatively homogeneous within groups.
The between group heterogeneity statistic is calcu- w wT wT 2
lated as
Group 1 (25% studies) 64.505 9.395 3.933
QB  Q  QW  31.799 11.093  20.706. Group 2 (50% studies) 45.466 13.640 4.092
Group 3 (100% studies) 169.417 57.458 28.015
Because 20.706 exceeds 5.99, the 95 percent point of Over all Groups 279.388 34.423 36.040
the chi-square distribution with 3  1  2 degrees of free-
source: Authors compilation.
dom, we reject the null hypothesis of no variation in ef-
fect size across studies with different proportions of male
authors. In other words, the relationship between the per-
centage of male authors and effect size is statistically sig- and reject H0 at level  (that is, decide that the effect pa-
nificant. rameter differs from 0) if the absolute value of Zi exceeds
15.2.1.5 Standard Errors of Means The sampling the 100 percent critical value of the standard normal __
variances
__ __v1, . . . , vp of the group mean effect estimates distribution. For example, for a two-sided test that i  0
T1, . . . , Tp are given by the reciprocals of the sums of the at   0.05 level of significance, reject the null hypothe-
weights in each group, that is sis if the absolute value of Z exceeds 1.96. When there is
1 ,
only one group of studies, this test is identical to that de-
vi  _____
mi
i 1, . . . , p. (15.10) scribed in chapter 14 of this volume using the statistic Z
w ij given in 14.5. __
j 1
Confidence intervals for the group mean effect 
__ i
can
Similarly, the sampling variance v of the grand be computed by multiplying the standard error vi by the
weighted mean is given by the reciprocal of the sum of all appropriate two-tailed critical value of the standard nor-
the weights or mal distribution (C  1.96 for   0.05 and 95 percent
confidence intervals) then adding and subtracting
__ this
1
v  _______. (15.11)
p m i amount from the weighted mean effect size__ Ti. Thus the
w
i 1 j1
ij 100(1  ) percent confidence interval for i is given by
__ ___ __ __ ___
__ The standard errors of
__ the group mean effect estimates T i  C vi   i  T i  C vi . (15.13)
Ti and the grand mean T are just the square roots of their
respective sampling variances. Example. We now continue the analysis of the stan-
15.2.1.6
__ Tests__and confidence intervals The group dardized mean differences from studies of gender differ-
means T1, . . . , Tp are normally __distributed__ about the ences in conformity of data set I. The effect-size estimate
respective effect-size parameters 1 , . . . , p that they Tij for each study, its variance vij, the weight wij  1/vij, wij
estimate. The fact that these means are normally distrib- Tij, and wij Tij2 (which will be used later) are given in table
uted with the variances given in equation 15.10 leads to 15.3. Using the sums for each group from table 15.4 __ __ the
rather straightforward procedures for constructing tests weighted
__ mean effect sizes for the three classes T1, T2,
and confidence intervals.__ For example, to test whether the and T3 are given by
ith group mean effect i differs __ from a predefined con- __
stant 0 (for example, to test if i  0  0) by testing the T 1  9.395/64.505  0.15,
null hypothesis __
T 2  13.640/45.466  0.30,
__ __
H0i :  i  0, T 3  57.458/169.417  0.34
use the statistic and the weighted grand mean effect size is
__
T i  0
Zi  ______
___ (15.12) __
vi T  34.423/279.388  0.123.
286 STATISTICALLY COMBINING EFFECT SIZES
__ __ __
The variances v1, v2, and v3 of T1, T2, and T3 are pattern of mean differences that might be present. For ex-
given by ample, the QB statistic might reveal variation in mean ef-
fects when the effects were grouped according to type of
v1  1/64.505  0.015, treatment, but the omnibus statistic gives no insight about
which types of treatment (which groups) were associated
v2  1/45.466  0.022,
with the largest effect size. In other cases, the omnibus
v3  1/169.417  0.006, test statistic may not be significant but we may want to
__ test for a specific a priori difference that the test might not
and the variance v of T is have been powerful enough to detect. In conventional
analysis of variance, contrasts or comparisons are used to
v  1/279.388  0.00358. explore the differences among group means. Contrasts
can be used in precisely the same way to examine pat-
Using formula 15.9 with C.05  1.96, the limits of the terns among group mean effect sizes in meta-analysis. In
95 percent
__ confidence interval for the group mean param- fact, all of the strategies used for selecting contrasts
eter 1 are given by in ANOVA (such as orthogonal polynomials to estimate
_____ trends, Helmert contrasts to discover discrepant groups)
0.15  1.960.015  0.15  0.24. are also applicable in meta-analysis.
__ A contrast parameter is just a linear combination of
Thus the 95 percent confidence interval for 1 is group means
given by __ __
__
  c1  1  . . .  cp  p. (15.14)
0.39   1  0.09.
where the coefficients c1, . . . , cp (called the contrast co-
Because this confidence interval contains zero, or al- efficients) are known constants that satisfy the constraint
ternately, because the test statistic c1  . . .  cp  0 and are chosen so that the value of the
contrast will reflect a particular comparison or pattern of
_____
Z  0.15 / 0.015 1.96, interest. For example, the coefficients c1  1, c2  1,
c3  . . .  cp  0 might be chosen so that the__ value of the
__
we cannot reject the hypothesis that 1  0 at the   0.05 contrast is the difference
__ between the mean 1 of group 1
level of significance. Similarly 95 percent and the mean 2 of group 2. Sometimes we refer to a
__ confi__dence
intervals for the group mean parameters 2 and 3 are contrast among population means as a population con-
given by trast or a contrast parameter to emphasize that it is a func-
tion of population parameters and to distinguish it from
_____
0.59  0.30  1.96 0.022 estimates of the contrast. The contrast parameter speci-
__ _____ fied by coefficients c1, . . . , cp is usually estimated by a
  2  0.30  1.96 0.022  0.01 sample contrast
__ __
and g  c1 T 1  . . .  cp T p . (15.15)
_____
0.19  0.34  1.96 0.006 The estimated contrast g has a normal sampling distri-
__ _____ bution with variance vg given by
  3  0.34  1.96 0.006  0.49.
vg  c12 v1  . . .  cp2 vp . (15.16)
Thus we see that the mean effect size for group 2 is sig-
nificantly less than zero, for group 3 is significantly greater This notation for contrasts suggests that the contrasts
than zero, and for group 1 is not different from zero. compare group mean effects, but they can in fact be used
15.2.1.7 Comparisons or Contrasts Among Mean to compare individual studies (groups consisting of a
Effects Omnibus tests for differences among group single study) or a single study with a group mean. All that
means can reveal that the mean effect parameters are not is required is the appropriate definition of the groups in-
all the same, but are not useful for revealing the specific volved.
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 287

Confidence Intervals and Tests of Significance. Be- procedures to contrasts in meta-analysis (see also Hedges
cause the estimated contrast g has a normal distribution and Olkin 1985).
with known variance vg, confidence intervals and tests of The simplest simultaneous test procedure is called the
statistical significance are relatively easy to construct. Bonferroni method. This exacts a penalty for simultane-
Note, however, that just as with contrasts in ordinary ous testing by requiring a higher level of significance from
analysis of variance, test procedures differ depending on each individual contrast for it to be declared significant in
whether the contrasts were planned or were selected the simultaneous test. If a number L 1 of contrasts are
using information from the data. Procedures for testing to be tested simultaneously at level , the Bonferroni test
planned comparisons and for constructing nonsimultane- requires that any contrast be significant at (nonsimultane-
ous confidence intervals are given in this section. Proce- ous) significance level /L o be declared significant at
dures for testing post hoc contrasts (contrasts selected level  in the simultaneous analysis. For example, if L  5
using information from the data) and simultaneous confi- contrasts were tested at simultaneous significance level
dence intervals are given in the next section.   0.05, any one of the contrasts would have to be indi-
Planned Comparisons. Confidence intervals for the vidually significant at the 0.01  0.05/5 level (that is, X 2
contrast parameter  are__
computed by multiplying the would have to exceed 6.635, the 99 percent point of the
standard error of g, vg by the appropriate two-tailed chi-square distribution with df  1) to be declared signifi-
critical value of the standard normal distribution cant at   0.05 level by the simultaneous test.
(C  1.96 for   0.05 and 95 percent confidence inter- The Bonferroni method can be used as a post hoc test
vals) and adding and subtracting this amount from the if the number of contrasts L is chosen as the number that
estimated contrast g. Thus the 100(1  ) percent confi- could have been conducted. This procedure works well
dence interval for the contrast parameter  is when all contrasts conducted are chosen from a well-
__ __ defined class. For example, if there are four groups, there
g  C vg    g  C vg . (15.17) are six possible pairwise contrasts, so the Bonferroni
method is applied to any pairwise contrast chosen post
Alternatively, a (two-sided) test of the null hypothesis hoc by treating it as one of six contrasts examined simul-
that   0 uses the statistic taneously. If number L of comparisons (or possible com-
parisons) is large, the Bonferroni method can be quite
X 2  g 2/vg . (15.18) conservative given that it rejects only if a contrast has a
very low significance value.
If X 2 exceeds the 100(1  ) percent point of the chi- An alternative test procedure, a generalization of the
square distribution with one degree of freedom, reject the Scheff method from the analysis of variance, can be
hypothesis that
 0 and declare the contrast to be sig- used for both post hoc and simultaneous testing (see
nificant at the level of significance . Hedges and Olkin 1985). It consists of computing the sta-
Post Hoc Contrasts and Simultaneous Tests. Situa- tistic X2 given in equation 15.18 for each contrast and
tions often occur where several contrasts among group rejecting the null hypothesis whenever X2 exceeds the
means are of interest. If several tests are made at the same 100(1  ) percent point of the chi-square distribution
nominal significance level , the chance that at least one with L degrees of freedom, where L is the smaller of L
of the tests will reject (when all of the relevant null hy- (the number of contrasts) or p  1 (the number of groups
potheses are true) is generally greater than  and can be minus one).
considerably greater if the number of tests is large. Simi- When the number of contrasts (or potential contrasts)
larly, the probability that tests will reject may also be is small, simultaneous tests based on the Bonferroni
greater than the nominal significance level for contrasts method will usually be more powerful. When the number
that are selected because they appear to stand out when of contrasts is large the Scheff method will usually be
examining the data. Simultaneous and post hoc testing more powerful.
procedures are designed to address these problems by en- Example. Continuing the analysis of standardized
suring that the probability of at least one type I error is mean difference data on gender difference in conformity,
controlled at a preset significance level . Many simulta- recall that there are three groups of effects: those in which
neous test procedures have been developed (see Miller 25 percent, 50 percent, and 100 percent respectively, of
1981). We now discuss the application of two of these the authors are male. To contrast the mean of the effects
288 STATISTICALLY COMBINING EFFECT SIZES

of group 1 (25 percent male authors) with those of group difference is that, because of the weighting involved,
3 (100 percent male authors), use the contrast coefficients multifactor analyses for effect sizes are essentially always
unbalanced. The details of the computations are therefore
c1  1.0, c2  0.0, c3  1.0. rather involved and the formulas for the omnibus test sta-
tistics not intuitively appealing. They are therefore not
The value of the contrast estimate g is presented here. Computations of multifactor analyses for
effect sizes, like unbalanced analyses of variance in pri-
g  1.0(0.15)  0.0(0.30)  1.0(0.34)  0.49, mary research, will almost always be carried out via pack-
aged computer programs. The test statistics for main ef-
with an estimated variance of fects and interactions are the weighted sums of squares,
which have chi-square distributions with the conventional
vg  (1.0)2 (0.015)  (0.0)2 (0.022)  (1.0)2 (0.006) degrees of freedom under the corresponding null hypoth-
 0.021. eses. Procedures for estimating group means and con-
trasts are exactly the same in the multifactor case as in the
a 95 percent confidence interval for
 one-factor case.
__ Hence
__
T3  T1 is 15.2.2.1 Computation of Omnibus Test Statistics
Suppose a two-factor model with two categorical vari-
_____ _____
0.21  0.49 1.96 0.021 
 0.49  1.960.021 ables. In this case, each of the omnibus test statistics is
obtained by computing a weighted analysis of variance
 0.77
on the effect estimates (for example, by using SAS, SPSS,
or STATA). Let Tijk be the kth effect estimate at the ith
Because this interval does not contain zero, or alterna-
_____ level of the first factor and the jth level of the second fac-
tively, because Z  0.49/0.021  3.38 exceeds 1.96, we
tor and let its sampling variance be vijk. Then the weight
reject the hypothesis that   0 and declare the contrast
wijk for corresponding to the estimate Tijk is
statistically significant at the   0.05 level.
Had we picked the contrast above after examining the
wijk  1/vijk ,
data, it would have been a post hoc contrast and testing
the contrast using a post hoc test would have been appro-
The omnibus test statistics used in the analysis are the
priate. Specifically, because there are a total of L  3 pair-
weighted sums of squares. When the corresponding null
wise contrasts among three group means, we might have
hypotheses are true, these sums have chi-square distribu-
carried out an   0.05 level post hoc test using the Bon-
tions with the number of degrees of freedom indicated by
ferroni method and compared Z  3.38 with 2.39, the
the computer program. Thus, for a two-factor analysis,
  0.05/3  0.167 critical value of the standard normal
the test statistic for the first factor, the second factor, the
distribution. Because the obtained value of the test statis-
interaction, and the within-cell homogeneity are the cor-
tic, Z  3.39 exceeds the critical value 2.39, we would
responding weighted sums of squares. The mean squares
still reject the hypothesis that   0. Alternatively, we
and F-test statistics are not used in this analysis. Simi-
might have used the Scheff method and compared
larly, the cell means and standard deviations reported
X2  Z2  3.392  11.49 with 5.99, the critical value of the
by the computer program are generally not useful. Thus
chi-square distribution with p  1  2 degrees of freedom.
the ANOVA computer program is used exclusively to
Because the obtained value of the test statistic, X2  11.49
compute the weighted sums of squares that are the omni-
exceeds the critical value 5.99, we would, using the
bus test statistics. Weighted means for each cell and for
Scheff method, still reject the hypothesis that   0.
each factor level should be computed separately using
the methods discussed in connection with the one-factor
15.2.2 Multifactor Models analysis.
15.2.2.2 Multifactor Models with Three or More
A multifactor analysis of variance for effect sizes exists Factors Procedures for multifactor models with more
and has the same relationship to conventional multifactor than two factors are analogous to those in two-factor
analysis of variance as the one-factor analysis for effect models. Omnibus tests for main effects, interactions, and
sizes has to one-factor analysis of variance. The principal within cell variation are carried out using the weighted
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 289

sums of squares as the test statistics with appropriate de- tation (see Hedges and Olkin 1985). The analysis can be
grees of freedom as indicated by the computer program. conducted using standard computer programs such as
The interpretation of main effects and interactions is SAS, SPSS, or STATA that compute weighted multiple
analogous to that in conventional analyses of variance in regression analyses. The regression should be run with
primary research. As in primary analysis, examining a the effect estimates as the dependent variable and the pre-
table of cell means (or cross tab table) before conducting dictor variables as independent variables with weights
two-way or multiway ANOVA is recommended, because defined by the reciprocal of the sampling variances. That
of the possibility of having empty cells. is, the weight for Ti is wi  1/vi.
Standard computer programs for weighted regression
analysis produce the correct (asymptotically efficient) es-
15.3 MULTIPLE REGRESSION ANALYSIS timates b0 , b1, . . . , bp of the unstandardized regression
FOR EFFECT SIZES coefficients 0 , 1, . . . , p. Note that unlike the SPSS
It is often desirable to represent the characteristics of re- computer program, we use the symbols 0 , 1, . . . , p
search studies by continuously coded variables or by a to refer to the population values of the unstandardized
combination of discrete and continuous variables. In such regression coefficients not to the standardized sample re-
cases, the synthesist often wants to determine the rela- gression coefficients. Although these programs give the
tionship between these continuous variables and effect correct estimates of the regression coefficients, the stan-
size. One very flexible analytic procedure for investigat- dard errors and significance values are based on a slightly
ing these relationships is an analogue to multiple regres- different model than that used for fixed-effects meta-
sion analysis for effect sizes (see Hedges 1982c, 1983b; analysis and are incorrect for the meta-analysis model.
Hedges and Olkin 1985). These methods share the gener- Calculating the correct significance tests for individual
ality and ease of use of conventional multiple regression regression coefficients requires some straightforward hand
analysis, and like their conventional counterparts they computations from information given in the computer
can be viewed as including ANOVA models as a special output.
case. In the recent medical literature, such methods have The correct fixed-effects standard error Sj of the esti-
been called meta-regression. mated coefficient estimate bj is simply
_______
Sj  SEj / MSERROR (15.20)
15.3.1 Models and Notation
where SEj is the standard error of bj as given by the com-
Suppose that we have k independent effect size estimates puter program and MSERROR is the error or residual mean
T1, . . . , Tk with (estimated) sampling variances v1, . . . , square from the analysis of variance for the regression as
vk. The corresponding effect-size parameters are 1, . . . , given by the computer program. Alternatively the correct
k. Suppose also that there are p known predictor vari- standard errors of b0 , b1, . . . , bp are the square roots of
ables X1, . . . , Xp which are believed to be related to the the diagonal elements of the inverse of the (XWX) ma-
effects through a linear model of the form trix which is sometimes called the inverse of the weighted
sum of squares and cross-products matrix. Many regres-
i  0  1 xi1  . . .  p xip, (15.19) sion analysis programs, such as SAS Proc GLM, will
print this matrix as an option. The (correct) standard er-
where xi1, . . . , xip are the values of the predictor vari- rors obtained from both methods are, of course, identical.
ables X1, . . . , Xp for the ith study (that is xij is the value of The methods are simply alternative ways to compute the
Xj for study i), and 0, 1, . . . , p are unknown regression same thing.
coefficients. The regression coefficient estimates (the bjs) are nor-
mally distributed about their respective parameter (the
15.3.2 Estimation and Significance Tests Sjs) values with standard deviations given by the stan-
dard errors (the Sjs). A 100(1  ) percent confidence
Estimation is usually carried out using weighted least interval for each j can thus be obtained by multiplying Sj
squares algorithms. The formulas for estimators and test by the two-tailed critical value C of the standard normal
statistics can be expressed most succinctly in matrix no- distribution (for   0.05, C  1.96) and then adding and
290 STATISTICALLY COMBINING EFFECT SIZES

subtracting this product from bj. Thus the 100(1  ) per- the analysis of variance for the regression was MSERROR 
cent confidence interval for j is 1.086 . Using formula 15.20 gives
_____
bj  CSj  j  bj  CSj (15.21) S0  .1151/1.086  0.110,
_____
S1  .0456/1.086  0.044.
A two-sided test of the null hypothesis that the regres-
sion coefficient is zero, A 95 percent confidence interval for the effect 1 of the
number of items using C0.05  1.96, S1  0.044, and for-
H0: j  0, mula 15.21 is given by .210  1.96 0.044

at significance level  consists of rejecting H0 if 0.12  1  0.30.

Zj  bj /Sj (15.22) Because the confidence interval does not contain zero,
or alternatively, because the statistic Z1  0.210/0.044 
exceeds the 100 percent two-tailed critical value of the 4.77 exceeds 1.96, we reject the hypothesis that there is
standard normal distribution. no relationship between number of items on the confor-
Example. Consider the example of the standardized mity measures and effect size. Thus the number of items
mean differences for gender differences in conformity on the response measure has a statistically significant re-
given in data set I. In this analysis, we fit the linear model lationship to effect size.
suggested by Betsy Becker, who explained variation in
effect sizes by a predictor variable that was the natural
logarithm of the number of items on the conformity mea- 15.3.3 Omnibus Tests
sure, see column 4 of table 15.3 (1986). This predictor
variable is highly correlated with the percentage of male It is sometimes desirable to test hypotheses about groups
authors used as a predictor in the example given for the or blocks of regression coefficients. For example, step-
categorical model analysis. Using SAS Proc GLM with wise regression strategies may involve entering one block
effect sizes and weights given in table 15.3, we computed of predictor variables (such as a set of variables reflecting
a weighted regression analysis. The estimates of the re- methodological characteristics) and then entering another
gression coefficients were b0  0.323 for the intercept block of predictor variables (such as a set of variables
and b1  0.210 for the effect of the number of items. The reflecting treatment characteristics) to see if the second
standard errors of b0 and b1 could be computed in either block of variables explains any of the variation in effect
of two ways. The (XWX) inverse matrix computed by size not accounted for by the first block of variables. For-
SAS was mally, we need a test of the hypothesis that all of the
regression coefficients for predictor variables in the sec-
0.01219 0.00406 ond block are zero.
0.00406 0.00191 Suppose that the a predictor variables X1, . . . , Xa have
already been entered and we wish to test whether the re-
and hence the standard errors can be computed as gression coefficients for a block of b additional predictor
variables Xa1, Xa2, . . . , Xab are simultaneously zero.
_______ That is, we wish to test
S0  0.01219  0.110,
_______
S1  0.00191  0.044. H0: a  1  . . .  a  b  0.

Alternatively, we could have obtained the standard errors The test statistic is the weighted sum of squares for the
by correcting the standard errors printed by the program addition of this block of variables. It can be obtained di-
(which are incorrect for our purposes). The standard er- rectly as the difference in the weighted error sum of
rors printed by the SAS program were SE(b0)  0.1151 squares for the model with a predictors and the weighted
and SE(b1)  0.0456, and the residual mean square from error sum of squares of the model with (a  b) predictors.
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 291

It can also be computed from the output of the weighted of variance for the regression gives QCHANGE  23.11. We
stepwise regression as could also have computed QCHANGE from the F-test statistic
FCHANGE for the R-squared change and the MSERROR for the
QCHANGE  bFCHANGE MSERROR (15.23) analysis of variance for the regression. Here FCHANGE 
21.29 and MSERROR  1.086, so the result using formula
where FCHANGE is the value of the F-test statistic for test- 15.23
ing the significance of the addition of the block of b
predictor variables and MSERROR is the weighted error or QCHANGE 1(21.29)(1.086)  23.12,
residual mean square from the analysis of variance for the
regression. The test at significance level  consists of re- is identical to that obtained directly (the two values dis-
jecting H0 if QCHANGE exceeds the 100(1  ) percent point agree only because of rounding). Comparing 23.12 to
of the chi-square distribution with b degrees of freedom. 3.84, the 95 percent point of the chi-square distribution
If the number k of effects exceeds ( p  1), the number with 1 degree of freedom, we reject the hypothesis that
of predictors plus the intercept, then a test of goodness of the (single) predictor is unrelated to effect size. This is, of
fit or model specification is possible. The test is formally course, the same result obtained by a test for the signifi-
a test of the null hypothesis that the population effect sizes cance of the regression coefficient.
1, . . . , k are exactly determined by the linear model We also test the goodness of fit of the regression model.
The weighted residual sum of squares was computed by
i  0  1 xi1  . . .  p xip , i  1, . . . , p, SAS Proc GLM as QE  8.69. Comparing this value to
15.51, the 95 percent point of the chi-square distribution
versus the alternative that some of the variation in the is with 10  2  8 degrees of freedom, we see that we can-
is not fully explained by X1, . . . , Xp . The test statistic is not reject the fit of the linear model. In fact, chi-square
the weighted residual sum of squares QE about the regres- values as large as 8.43 would occur between 25 and 50
sion line, and the test can be viewed as a test for greater percent of the time due to chance if the model fit exactly.
than expected residual variation. This statistic is given in
the analysis of variance for the regression and is usually 15.3.4 Collinearity
called the error or residual sum of squares on computer
printouts. The test at significance level  consists of re- All of the problems that arise in connection with multiple
jecting the null hypothesis of model fit if QE exceeds the regression analysis can also arise in meta-analysis. Col-
100(1  ) percent point of the chi-square distribution linearity may degrade the quality of estimates of regres-
with (k  p  1) degrees of freedom. sion coefficients in primary research studies, wildly influ-
The tests of homogeneity of effect and tests of homo- encing their values and increasing their standard errors.
geneity of effects within groups of independent effects The same procedures used to safeguard against excessive
described in connection with the analysis of variance for collinearity in multiple regression analysis in primary re-
effect sizes are special cases of the test of model fit given search are useful in meta-analysis. Examination of the
here. That is, the statistic QE reduces to the statistic Q correlation matrix of the predictors, and the exclusion of
given in formula 15.6 when there are no predictor vari- some predictors that are too highly intercorrelated with
ables, and QE reduces to the statistic QW given in formula the others can often be helpful. In some cases, predictor
15.7 when the predictor variables are dummy coded to variables derived from critical study characteristics may
represent group membership. be too highly correlated for any meaningful analysis
Example. Continue the example of the regression anal- using more than a very few predictors. It is important to
ysis of the standardized mean differences for gender dif- recognize, however, that collinearity is a limitation of the
ferences in conformity, using SAS Proc GLM to compute data (reflecting little information about the independent
a weighted regression of effect size on the logarithm of relations among variables) and not an inadequacy of the
the number of items on the conformity measure. Although statistical method. Highly collinear predictors based on
we can illustrate the test for the significance of blocks study characteristics thus imply that the studies simply do
of predictors, there is only one predictor. We start with not have the array of characteristics that might make it
a  0 predictors and add b  1 predictor variables. The possible to ascertain precisely their joint relationship with
weighted sum of squares for the regression in the analysis study effect size. In our example about gender differences
292 STATISTICALLY COMBINING EFFECT SIZES

in conformity, the correlation coefficient between percent sampling) effects. We might write this partitioning of
of male authors and number of items is 0.66. The correla- variation symbolically as
tion between percent of male authors and log items is
even higher, 0.79. Var[T]  2  v,

where Var[T] is the total variance, 2 is the between-study


15.4 QUANTIFYING EXPLAINED VARIATION variance in the effect-size parameters and v is the within-
study variance (the variance of the sampling errors). Only
Although the QE statistic (or the QW statistic for models between-study effects are systematic and therefore only
with categorical independent variables) is a useful test can be explained using predictor variables. Variance due
statistic for assessing whether there is any statistically re- to within-study sampling errors cannot be explained.
liable unexplained variation, it is not a useful descriptive Consequently, the maximum proportion of variance that
statistic for quantifying the amount of unexplained varia- could be explained is determined by the proportion of
tion. Quantifying the amount of unexplained variation total variance attributable to between-study effects. Thus
typically involves imposing a model of randomness on the maximum possible value of the squared multiple cor-
this unexplained variation. This is equivalent to imposing relation could be expressed (loosely) as
a mixed model on the data to describe unexplained varia- 2 2
tion. The details of such models are beyond the scope ______  ______ .
  v Var[T]
2
of this chapter (for a discussion, see chapter 16 of this
volume). Clearly, this ratio can be quite small when the between-
A descriptive statistic RB, called the Birge ratio, has studies variance is small relative to the within-study
long been used in the physical sciences and was recently variance. For example, if the between-study variance
proposed as a descriptive statistic for quantifying unex- (component) 2 is 50 percent of the (average) within-
plained variation in medical meta-analysis with no co- study variance, the maximum squared multiple correla-
variates, where it was called H2 (see Birge 1932; Higgins tion would be
and Thompson 2002). It is the ratio of QE (or QW) to its
degrees of freedom, that is, RB  QE / (k  p  1), (for a 0.5/(0.5 1.0)  0.33.
regression model with intercept) or RB  QW / (k  p), (for
a categorical model). The expected value of RB is exactly In this example, predictors that yielded an R2 of 0.30
1 when the effect parameters are determined exactly by would have explained 90 percent of the explainable vari-
the linear model. When the model does not fit exactly, ance even though they explain only 30 percent of the total
RB tends to be larger than one. The Birge ratio has the variance in effect estimates.
crude interpretation that it estimates the ratio of the A better measure of explained variance than the con-
between-study variation in effects to the variation in ef- ventional R2 would be based on a comparison of the
fects due to (within-study) sampling error. Thus a Birge between-studies variance in the effect-size parameters in
ratio of 1.5 suggests 50 percent more between-study vari- a model with no predictors and that in a model with pre-
ation than might be expected given the within-study sam- dictors. If 02 is the variance in the effect-size parameters
pling variance. in a model with no predictors and 12 is the variance in
The squared multiple correlation between the observed the effect-size parameters in a model with predictors,
effect sizes and the predictor variables is sometimes used then the ratio
as a descriptive statistic. However, the multiple correla- 2  2 2
tion may be misinterpreted in this context because the P2MA  ________
0
2
1
1 ___
2
1

0 0
maximum value of the population multiple correlation is
always less than, and can be much less than, one. The of the explained variance to the total variance in the
reason is that the squared multiple correlation is a mea- effect-size parameters is an analogue to the usual concept
sure of variance (in the observed effect-size estimates) of squared multiple correlation in the meta-analytic con-
accounted for by the predictors. But there are two sources text. Such a concept is widely used in the general mixed
of variation in the effect-size estimates, between-study model in primary data analysis (see Bryk and Rauden-
(systematic) effects and within-study (nonsystematic or bush 1992, 65). The parameter PMA2 is interpretable as the
ANALYZING EFFECT SIZES: FIXED-EFFECTS MODELS 293

proportion of explainable variance that is explained in the Birge, Raymond T. 1932. The Calculation of Errors by
meta-analytic model. the Method of Least Squares. Physical Review 40(2):
20727.
15.5 CONCLUSION Bryk, Anthony S., and Steven S. Raudenbush. 1992. Hie-
rarchical Linear Models. Thousand Oaks, Calif.: Sage Pu-
Fixed-effects approaches to meta-analysis provide a vari- blications.
ety of techniques for statistically analyzing effect sizes. Eagly, Alice H., and Linda L. Carli. 1981. Sex of Resear-
These techniques are analogous to fixed-effects statistical chers and Sex Typed Communication as Determinants of
methods commonly used in the analysis of primary Sex Differences in Influencability: A Meta-Analysis of
datathe analysis of variance and multiple regression Social Influence Studies. Psychological Bulletin 90(1):
analysis. Consequently, familiar analysis strategies, such 120.
as contrasts from analysis of variance, or coding methods, Hedges, Larry V. 1982b. Fitting Categorical Models to Ef-
such as dummy or effect coding from multiple regression fect Sizes from a Series of Experiments. Journal of Edu-
analysis, can be used in meta-analysis just as they are in cational Statistics 7(2): 11937.
primary analyses. Fixed-effects analyses, however, may . 1982c. Fitting Continuous Models to Effect Size
not be suitable for all purposes in meta-analysis. One Data. Journal of Educational Statistics 7(4): 24570.
major consideration in the choice to use fixed-effects . 1983a. A Random Effects Model for Effect Sizes.
analyses is the nature of the generalizations desired. Fixed- Psychological Bulletin 93(2): 38895.
effects methods are well suited to making conditional in- . 1983b. Combining Independent Estimators in Re-
ferences about the effects in the studies that have been search Synthesis. British Journal of Mathematical and Sta-
observed. They are less well suited to making generaliza- tistical Psychology 36: 12331.
tions to populations of studies from which the observed Hedges, Larry V., and Ingram Olkin. 1985. Statistical
studies are a sample. Methods for Meta-Analysis. Orlando, Fl.: Academic Press.
Hedges, Larry V., and Jack L. Vevea. 1998. Fixed and Ran-
15. 6 REFERENCES dom Effects Models in Meta-Analysis. Psychological
Methods 3(4): 486504.
Becker, Betsy J. 1986. Influence Again: An Examination of Higgins, Julian P.T., and Simon G. Thompson. 2002. Quan-
Reviews and Studies of Gender Differences in Social In- tifying Heterogeneity in Meta-Analysis. Statistics in Me-
fluence. In The Psychology of Gender: Advances Through dicine 21(11): 153958.
Meta-Analysis edited by Janet S. Hyde and Marcia C. Linn. Miller, Rupert. G. Jr. 1981. Simultaneous Statistical Infer-
Baltimore, Md.: Johns Hopkins University Press. ence, 2nd ed. New York: Springer-Verlag.
16
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS

STEPHEN W. RAUDENBUSH
University of Chicago

CONTENTS
16.1 Introduction 296
16.2 Rationale 297
16.2.1 The Classical View 297
16.2.2 The Bayesian View 298
16.3 The Model 298
16.3.1 Level-1 Model 298
16.3.2 Level-2 Model 299
16.3.3 Mixed Model 299
16.4 Data 299
16.5 The Logic of Inference 300
16.5.1 Do the Effect Sizes Vary? 300
16.5.2 How Large Is the Variation in True Effect Sizes? 302
16.5.2.1 A Variation to Signal Ratio 302
16.5.2.2 Plausible Value Intervals 302
16.5.2.3 Intra-Class Correlation 302
16.5.3 How Can We Make Valid Inferences About the Average
Effect Size When the True Effect Sizes Vary? 303
16.5.4 Why Do Study Effects Vary? 303
16.5.5 How Effective Are Such Models in Accounting for Effect Size
Variation? 304
16.5.6 What Is the Best Estimate of the Effect for Each Study? 304
16.5.6.1 Unconditional Shrinkage Estimation 304
16.5.6.1 Conditional Shrinkage Estimation 306
16.6 Summary and Conclusions 306
16.6.1 Advantages of the Approach 306

295
296 STATISTICALLY COMBINING EFFECT SIZES

16.6.2 Threats to Valid Statistical Inference 307


16.6.2.1 Uncertainly About the Variance 307
16.6.2.2 Failure of Parametric Assumptions 307
16.6.2.3 Problems of Model Misspecification and Capitalizing on
Chance 307
16.6.2.4 Multiple Effect Sizes 308
16.6.2.5 Inferences About Particular Effect Sizes 308
16A Alternative Approaches to Point Estimation 308
16A.1 Full Maximum Likelihood 308
16A.2 Restricted Maximum Likelihood 310
16A.3 Method of Moments 311
16A.3.1 MOM Using OLS Regression at Stage 1 311
16A.3.2 MOM Using WLS with Known Weights 311
16A.4 Results 312
16B Alternative Approaches to Uncertainty Estimation 312
16B.1 Conventional Approach 312
16B.2 A Quasi-F Statistic 312
16B.3 Huber-White Variances 313
16.7 Notes 314
16.8 References 314

16.1 INTRODUCTION 2. If so, how large is this variation?


This volume considers the problem of quantitatively sum- 3. How can we make valid inferences about the aver-
marizing results from a stream of studies, each testing a age effect size when the true effect sizes vary?
common hypothesis. In the simplest case, each study 4. Why do study effects vary? Specifically do observ-
yields a single estimate of the impact of some interven- able differences between studies in their target pop-
tion. Such an estimate will deviate from the true effect ulations, measurement approaches, definitions of
size as a function of random error because each study the treatment, or historical contexts systematically
uses a finite sample size. What is distinctive about this predict the effect sizes?
chapter is that the true effect size itself is regarded as a 5. How effective are such models in accounting for ef-
random variable taking on different values in different fect size variation? Specifically, how much variation
studies, based on the belief that differences between the in the true effect sizes does each model explain?
studies generate differences in the true effect sizes. This 6. Given that the effect sizes do indeed vary, what is
approach is useful in quantifying the heterogeneity of ef- the best estimate of the effect in each study?
fects across studies, incorporating such variation into
confidence intervals, testing the adequacy of models that I illustrate how to address these questions by re-ana-
explain this variation, and producing accurate estimates lyzing data from a series of experiments on teacher ex-
of effect size in individual studies. pectancy effects on pupils cognitive skill. My aim is to
After discussing the conceptual rationale for the ran- illustrate, in a comparatively simple setting, to a broad
dom effects model, this chapter provides a general strat- audience with a minimal background in applied statistics,
egy for answering a series of questions that commonly the conceptual framework that guides analyses using ran-
arise in research synthesis: dom effects models and the practical steps typically
needed to implement that framework.
1. Does a stream of research produce heterogeneous Although the conceptual framework guiding the analy-
results? That is, do the true effect sizes vary? sis is straightforward, a number of technical issues must
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 297

be addressed satisfactorily to ensure the validity the of the communities providing setting. For any synthesis,
inferences. To review these issues and recent progress in a long list of possible moderators of effect size can be
solving them requires a somewhat more technical presen- enumerated, many of which cannot be ascertained even
tation. Appendix 16A considers alternative approaches by the most careful reading of each studys report.
to estimation theory, and appendix 16B considers alter- The multiplicity of potential moderators underlies the
native approaches to uncertainty estimation, that is, the concept of a studys true effect size as random. Indeed,
estimation of standard errors, confidence intervals, and the concept of randomness in nature arises from a belief
hypothesis tests. These appendices together provide re- that the outcome of a process cannot be predicted in ad-
analyses of the illustrative data under alternative ap- vance, precisely because the sources of influence on the
proaches, knowledge of which is essential to those who outcome are both numerous and unidentifiable. Yet two
give technical advice to analysts. distinctly different statistical rationalesclassical and
Bayesianprovide quite different justifications for the
16.2 RATIONALE random effects approach.

A skilled researcher wanting to assess the generalizabil- 16.2.1 The Classical View
ity of findings from a single study will try to select a
sample that represents the target population. The re- We may conceive of the set of studies in a synthesis as
searcher will examine interaction effectsby asking, for constituting a random sample from a larger population of
example, whether a new medical treatments impact de- studies. For some syntheses, this assumption is literally
pends on the age of the patient, whether the effectiveness true. For example, because Robert Rosenthal and Donald
of a new method of instruction depends on student apti- Rubin located far more studies of interpersonal expec-
tude, whether men are as responsive as women to nonver- tancy effects than they could code in any detail with
bal communication. If such interactions are nullif available resources, they drew a random sample of stud-
treatment effects do not depend on the age, aptitude, or ies for intensive analysis (1978).
sex of the participanta finding is viewed as generaliz- One might also view the set of studies under synthesis
ing across these characteristics. If, on the other hand, as constituting a realization from a universe of possible
such interactions are significant, the researcher can clar- studiesstudies that realistically could have been con-
ify the limits of generalization. ducted or that might be conducted in the future. Because
Inevitably, however, the capacity of a single study to one wishes to generalize to this larger universe of possi-
clarify the generalizability of a finding is limited. Of ne- ble implementations of treatment or observations of a
cessity, many conditions that might affect a finding will relationship, one conceives of the true effect size of each
be constant in any given study. The manner in which the available study as one item from a random sample of all
treatment is implemented, the geographic or cultural set- possible effect sizes. This notion of a hypothetical uni-
ting, the instrumentation used, and the era during which verse of possible observations is standard in measure-
an experiment is implemented are all potential modera- ment theory. Lee Cronbach and his colleagues put it this
tors of study findings. Such moderators can rarely be way: A behavioral measurement is a sample from the
examined in the context of a single study because these measurements that might have been made, and interest
factors rarely vary within a single study. attaches to the obtained score only because it is represen-
Across a series of replicated or similar studies, how- tative of the whole collection or universe (1972, 18).
ever, such moderators will naturally tend to vary. For ex- Once the set of studies under investigation is conceived
ample, later in this chapter we reanalyze data from nine- as a random sample, it is clear that a meta-analysis can
teen experiments testing the effects of teacher expectancy readily be viewed as a survey having a two- stage sam-
on pupil IQ. Several factors widely believed to influence pling procedure. At the first stage, one obtains a random
results in this type of study varied across those studies: sample of studies from a larger population of studies and
the timing and duration of the treatment, the conditions of regards these studies as varying in their true effects. At the
pupil testing, the age of the students, the year the study second stage, within each study, one then obtains a ran-
was conducted, its geographic location and administra- dom sample of subjects from a population of subjects.1
tive context, the characteristics of the teachers expected The two-stage sampling design implies a model with
to implement the treatment, and the socioeconomic status two components of variance. The first component (random
298 STATISTICALLY COMBINING EFFECT SIZES

effects variance) represents the variance that arises as a sizes. The exchangeability assumption would then be un-
result of the sampling of studies, each having a unique warranted. Both Bayesian and classical statisticians would
effect size. The second component (estimation variance) then be inclined to incorporate such information into the
arises because, for any given study, the effect size esti- model, as I illustrate in section 16.3.2.
mate is based on a finite sample of subjects.2 Our discussion reveals that classical and Bayes ap-
proaches to random effects meta-analysis assign different
16.2.2 The Bayesian View definitions to quantities like the mean and variance of the
true random effects, but use those quantities in function-
The Bayesian view avoids the specification of any sam- ally similar ways. In interpreting the results presented
pling mechanism as a justification for the random effects here, you the reader are free to use either the classical
model. Instead, the random variation of the true effect conception (that the variance of the computed effect sizes
sizes represents the investigators lack of knowledge about represents a two-stage sampling process) or the Bayesian
the process that generates them. Imagine that one could interpretation (that the variance of the computed effects
have observed the true effect size for a very large number, reflects two sources of the investigators uncertainly). I
k, of related studies. A Bayesian might say that there stud- say more about the Bayesian approach in section 16.6.
ies are exchangeable (De Finetti 1964) in that she or he
has a priori no reason to believe that one studys effect size 16.3 THE MODEL
should be any larger than any other studys effect size. The
Bayesian investigator represents his or her uncertainty I represent the random effects model as a two-level hier-
about the heterogeneity of the effect sizes with a prior dis- archical linear model (Raudenbush and Bryk 1985; 2002,
tribution having a central tendency and a dispersion. chapter 7). At the first, or micro level, the estimated effect
Just as the classical statistician might envision a syn- size for study i varies randomly around the true effect
thesis as a two-stage sampling process, the Bayesian stat- size. This might be regarded as a measurement model in
istician envisions two components of uncertainty. At one which the estimated effect size measures the true effect
level, there is the uncertainty that a particular true effect size with and error. At the second, or macro level, the true
size will be near the prior mean, that is, the statisticians a effect sizes vary from study to study as a function of ob-
priori belief about of the central tendency of all true effect served study characteristics plus a random effect repre-
sizes. The degree of uncertainty about the true effect size senting unobserved sources of heterogeneity of effect
is expressed quantitatively as a prior variance. At the next size. The model can be expanded at both levels: the first
level, there is uncertainty that an estimated effect for any level can specify multiple effect sizes, each of which is
given study will be near that studys true effect size. This explained by study characteristics at level 2 (Kalaian and
uncertainty is expressed as a within-study variancethe Raudenbush1996; see chapter 9, this volume). This chap-
variability of the estimated effect size around the true ef- ter focuses on the case of a single effect size per study,
fect size for each study. however.
The notion that the true effect sizes are exchangeable is
functionally equivalent to the classical statisticians as- 16.3.1 Level-1 Model
sumption that they constitute a random sample from a
population of studies. However, exchangeability as a basis Suppose, then, that we have calculated an effect size esti-
for a random effects model can be invoked without ap- mate Ti of the true effect size i for each of k studies,
pealing to the notion of a random sampling mechanism. i  l, . . . , k. The true effect size i might be a mean dif-
The Bayesian statisticians way of thinking about the ference, a standardized mean difference, a correlation
variance of the estimated effect size around the true effect (possibly transformed using the Fisher transformation),
size is generally identical to the thinking of the classical the natural logarithm of the odds ratio, a log-event rate,
statistician. or some other meaningful indicator of effect magnitude.
Suppose, however, that there is a priori reason to be- The estimate Ti is the corresponding sample statistic,
lieve that some studies will have larger effects than others. sometimes corrected for bias. Chapters 12 and 13 of this
For example, a treatment may have a longer duration in volume present a number of useful measures of effect
one set of studies than in another, and one may hypothe- and their corresponding sample estimators and sampling
size that longer durations are associated with larger effect variances.
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 299

If the sample size per study is moderate to large, as- Equation 16.3 is identical to the fixed effects regres-
suming that Ti is normally distributed about i with known sion model with one exception, the addition of the ran-
variance vi is often reasonable. The form of vi depends on dom effect ui. Under the fixed-effects specification, the
the effect indicator of interest, but generally diminishes at study characteristics Xil, . . . , Xip are presumed to account
a rate proportional to 1/ni, where ni is the sample size in completely for the variation in the true effect sizes so that
study i. We therefore write the level-1 model as 2  0 and therefore vi*  vi . In contrast, the random ef-
fects specification allows for the possibility that part of
Ti  i  ei, ei  N(0, vi), (16.1) the variability in these true effects is unexplainable by the
model. The term mixed effects linear model emphasizes
where we assume that the errors of estimation ei are sta- that the model has both fixed effects l, . . . , p and ran-
tistically independent, each with a mean of zero and dom effects ui, i  l, . . . , k.
known variance vi. The normality assumption for ei is
often justifiable based on the central limit theorem (com-
pare Casella and Berger 2001) as long as the sample size
16.4 DATA
in each study is not too small.
In their landmark experiment, Robert Rosenthal and Linda
Jacobson tested all of the children in a single elementary
16.3.2 Level-2 Model
school and, ostensibly based on their results, informed
We now formulate a prediction model for the true effects teachers that certain children could be expected to make
as depending on a set of study characteristics plus error: dramatic intellectual growth in the near future (1968). In
fact, these expected bloomers had been assigned at ran-
i  0  1Xi1  1Xi2  . . .  p Xip  ui , dom to the experimental high expectancy condition. The
ui ~ N(0, 2) (16.2) remaining children made up the control group. The ex-
perimenters found that, by the end of the year, the experi-
where 0 is the model intercept; mental children had gained significantly more than the
control children in IQ, leading to the conclusion that the
Xil, . . . , Xip are coded characteristics of studies hypoth- teachers had treated the experimental children in ways
esized to predict the study effect size i; that fulfilled the expectation of rapid cognitive growth, a
l, . . . , p are regression coefficients capturing the as- demonstration of the self-fulfilling prophecy in educa-
sociation between study characteristics and effect tion. Practitioners, policymakers, and many advocates of
sizes; and educational reform hailed these results as showing that
unequal teacher expectations were creating inequality in
ui is the random effect of study i, that is, the deviation
the cognitive skills of the nations children (Ryan 1971).
of the study is true effect size from the value pre-
Naturally, these findings drew the attention of skeptical
dicted on the basis of the model. Each random ef-
social scientists, some of whom challenged the studys
fect, ui, is assumed independent of all other effect
procedures and findings, leading to a series of re-analyses
sizes with the mean of zero and variance 2.
and ultimately to a number of attempts to replicate those
findings in new experiments. I was able to locate nineteen
16.3.3 Mixed Model experimental studies of the hypothesis that teacher ex-
pectations influence pupil IQ (see Raudenbush 1984;
If we combine the two models by substituting the expres- Raudenbush and Bryk 1985). The data from that synthe-
sion for i in (16.2) into (16.1), we have a mixed effects sis are reproduced in table 16.1 (see table 16.1).
linear model
__ The__data from each study produced an estimate Ti 
(YEi  YCi)/Si of__the__unknown true effect size i  (Ei 
Ti  0  1Xi1  1Xi2  . . .  p Xip  ui  ei Ci)/i where YEi , YCi are the sample mean outcomes of
ui  ei ~ N(0,vi*), (16.3) the experimental and control groups, respectively, within
study i; Ei ,Ci are the corresponding study-specific pop-
where v*i  2  vi is the total variance in the observed ulation means; Si is the pooled, within-treatment sample
effect size Ti. standard deviation, and i is the corresponding population
300 STATISTICALLY COMBINING EFFECT SIZES

Table 16.1 Effect of Teacher Expectancy on Pupil IQ

Xi  Weeks of Prior 1.0000


Ti  effect vi  sampling Teacher Student
Study size estimate variance of Ti Contact

Effect size
0.5000
1 0.03 0.015625 2
2 0.12 0.021609 3
3 0.14 0.027889 3
4 1.18 0.139129 0 0.0000
5 0.26 0.136161 0
6 0.06 0.010609 3 R Sq Cubic = 0.502
7 0.02 0.010609 3
8 0.32 0.048400 3
9 0.27 0.026896 0

00

00

00

00

00

00

00
10 0.87 0.063001 1

00

50

00

50

00

50

00
0.

0.

1.

1.

2.

2.

3.
11 0.54 0.091204 0
12 0.18 0.049729 0 Prior Contact
13 0.02 0.083521 1
14 0.23 0.084100 2 Figure 16.1 Estimated Effect Size Ti as a Function of Xi
15 0.18 0.025281 3 source: Authors compilation.
16 0.06 0.027889 3
17 0.30 0.019321 1
18 0.07 0.008836 2 For these reasons, we hypothesize that the length of
19 0.07 0.030276 3 time that teachers and children knew each other at the
time of the experiment will be negatively related to a
source: Raudenbush and Bryk 1985.
studys effect size. Table 16.1 therefore lists X i  0, 1, 2,
or 3, depending on whether prior teacher-pupil contact
occurred for less than one, one, two, or more than two
standard deviation. Following Larry Hedges (1981), we weeks before the experiment.
define the sampling variance of Ti as an estimate of i to A simple visual examination of the data provides quite
be closely approximated by vi  (niE  niC)/(niE niC)  i2/ striking support for this hypothesis. Figure 16.1 shows a
[2(niE  niC)] where we substitute a consistent estimate scatter plot of the estimated effect size Ti as a function of
(for example Ti ) for the unknown value of i. Xi  weeks of prior teacher-student contact. I have super-
Early commentary on teacher-expectancy experiments imposed a cubic spline to summarize this relationship,
suggested that the experimental treatment is difficult to revealing that the relationship is essentially linear.
implement if teachers and their students know each other
well at the time of the experiment. Essentially, the treat-
ment depends on deception: researchers attempt to con- 16.5 THE LOGIC OF INFERENCE
vince teachers that certain children can be expected to
Let us now consider how to apply the logic of the random
show dramatic intellectual growth. In fact, those desig-
effects model 16.3 to answer the six questions listed at
nated are assigned at random to this high expectancy
the beginning of this chapter and using the teacher expec-
condition, so that if the deception is successful, the only
tancy data (table 16.1) for illustration. The results dis-
difference between experimental and control children is
cussed below are summarized in table 16.2.
in the minds of the teachers. Early studies provided anec-
dotal evidence that the deception will fail if the teachers
have had enough previous contact to form their own judg- 16.5.1 Do the Effect Sizes Vary?
ments of the childrens capabilities. In those cases, teach-
ers will rely more heavily on their own experience with According to the general model (see equation 16.3), the
the children than on test information supplied by outside true effect sizes vary for two reasons. There are predict-
experts. able differences between studies, based on study charac-
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 301

Table 16.2 Statistical Inferences for Three Models

Homogenous Simple Random- Mixed-Effects


Parameter Effects Modela Effects Modelb Modelc

Intercept
 0 0.060 0.083 0.407
SE(  0) (0.036) (0.052) (0.087)
t-ratio 1.65 1.62 4.68
Coefficient for X  Weeks
 1 0.157
SE(  0) (0.036)
t-ratio 4.90
Random-Effects Variance, 2

2 0.019 0.000
Q 35.83 16.57
df 18 18
p-value (p  .01) (p  .50)
Study-Specific Effects, 
min, max(Ti)d (0.14,1.18) (0.14, 1.18) (0.14, 1.18)
min, max(i*) (0.03, 0.25) (0.07, 0.41)
95% P.V. Interval for () (0.19, 0.35) (0.07, 0.41)

source: Authors compilation.


note: The homogenous model was estimated by maximum likelihood; the simple random-effects model and
the mixed-effects model were estimated using restricted maximum likelihood.
a  
i 0
b
i  0  ui Var(ui)  2
c
i  0  1Xi  ui Var(ui|Xi)  2
d
In all cases Ti  i  ei Var(ei)  vi

k
teristics X; and unpredictable differences based on study- that is, Precision(  0)  vi1, so that the sampling vari-
specific random effects, u. Therefore, if every regression i1
ance of 16.5 is
coefficient  and every random effect u are zero, all stud-
ies will have the same true effect size (i  0 for all i), 1

( )
k
and the only variation in the estimated effect sizes will Var(  0)  v
i1
1
i . (16.6)
arise from the sampling errors, e. This is the case of ho-
mogenous effect sizes. Equation 16.3 thus reduces to _______
Table 16.2 shows the results,  0  .060, se  Var( 0) 
Ti  0  ei, ei ~ N(0, vi). (16.4) .036. Under the assumption of homogeneity of effect size,
we can test the null hypothesis H0 : 0  0. Under this
Every effect size Ti is thus an unbiased estimate of the hypothesis, all study effect sizes are null. We therefore
common effect size 0, having known variance vi. The compute
best estimate3 of 0 is then _______
k k z  0 / Var (  0)  1.65, (16.7)
 0  vi1Ti
i1 / v
i1
1
i. (16.5)
providing modest evidence against the null hypothesis.
The reciprocals of the variances vi are known as preci- A 95 percent confidence interval is
sions, and model 16.5 is a precision-weighted average. _______
The precision of model 16.5 is the sum of the precisions,  0  1.96 Var (  0)  (0.011, 0.132).
302 STATISTICALLY COMBINING EFFECT SIZES

Such an inference is based, however, on the assump- characterize how different the true effect sizes are from
tion that the effect sizes are homogenous. We need to test each other.
this assumption to discover whether the effect sizes vary A more direct measure of heterogeneity is a good esti-
and, if so, why. mate of 2 itself. We therefore expand the simple model
Under the homogeneous model, 2  0, and the sum of 16.4 to include study specific random effects, yielding
squared differences between the effect size estimate Ti
and their estimated mean  0, each divided by vi, will be Ti  0  ui  ei, ei ~ N(0,vi*), vi*  2  vi , (16.10)
distributed as a central chi-squared random variable with
k  1  18 degrees of freedom. However, if 2  0, the a special case of the general model 16.3 with 1, . . . , p
denominators vi will be too small, suggesting rejection of set to zero. According to this model, none of the variation
the null hypothesis 2  0. Thus, following Hedges and in the true random effects is predictable. Instead, all of
Olkin (1983), we can compute the chi-square test of ho- the variation is summarized by a single parameter, 2.
mogeneity This model has been labeled the simple random effects
k model (Hedges 1983) to distinguish it from the mixed
Q0  vi1(Ti  0)2  35.83, df  18, p  .007, (16.8) random effects model 16.3.
i1 Our point estimates, based on restricted maximum like-
suggesting rejection of the null hypothesis, implying that lihood estimation,4 are
the true effect sizes do indeed vary across studies, that is
2  0. The next logical question is how large this varia-  0  0.084,  2  0.019. (16.11)
tion is.
From this simple result (see table 16.2), we can com-
pute two useful descriptive statistics to characterize the
16.5.2 How Large is the Variation in True magnitude of the heterogeneity in the true effect sizes.
Effect Sizes? 16.5.2.2 Plausible Value Intervals Under our model,
the true effect sizes are normally distributed with esti-
_____
16.5.2.1. A Variation to Signal Ratio One simple in- mated mean .084 and standard deviation   0.019 
dicator of the extent to which studies vary is the simple 0.138. Under this estimated model, 95 percent of the true
ratio (Higgins and Thompson 2002): effect sizes would lie (approximately) between the two
numbers:
H  Q0/(k  1)  35.83/18  1.99 (16.9)
 0  1.96 *    .083  1.96 * 0.138.
Under the null hypothesis, H has an expected value of
unity. Our value of H can be therefore regarded as sug-  (0.187, 0.353). (16.12)
gesting that the variation in the estimated effect sizes Ti,
i  1, . . . , k is about twice what would be expected under We thus estimate that about 95 percent of the true effect
the null hypothesis. A nice feature of this statistic is that sizes lie between 0.187 and 0.353. Stephen Raudenbush
it can be used to compare the heterogeneity of effect sizes and Anthony Bryk called this the 95 percent plausible
from research syntheses that use different measures of ef- value interval for  (Raudenbush and Bryk 2002). We will
fect. For example, one synthesis might use an odds ratio revise this estimate after fitting an explanatory model. Our
to describe the effect of a treatment on a binary outcome, interval suggests that the effect sizes are mostly positive
but another study, like ours, might use as standardized though small negative effect sizes are possible as well.
mean difference as a measure of effect of a treatment on 16.5.2.3 Intra-Class Correlation Julian Higgins and
a continuous outcome. The analyst can meaningfully Simone Thompson suggested several variance ratios to
compare H statistics from the two studies. represent the degree of heterogeneity in a set of esti-
Although useful, H as a measure of heterogeneity of mated effect sizes (2002). In a similar vein, Raudenbush
effect size depends on the design of the studies as well as and Bryk (1985) proposed a measure of reliability with
the magnitude of the differences in the true effect sizes. which study is effect size is measured:
In particular, for any value of 2, H will tend to increase
as the sample sizes within studies grow. Thus, H does not i  2 / ( 2  vi ), (16.13)
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 303
_______
closely related to the intraclass correlation coefficient. Table 16.2 shows the results,  0  .083, se  Vr (  0) 
Note that this reliability varies from study to study as a 0.052. To test the null hypothesis H0 : 0  0, we therefore
function of vi. An overall measure of reliability is compute
k _______
  i /k .37. (16.14) t  0 / Vr (  0)  0.083/0.052  1.62, (16.17)
i1

Thus about 37 percent of the variation in the estimated This result gives very slightly weaker evidence against
effect sizes reflects variation in the true effect sizes. This the null hypothesis than obtained when we assumed
result cautions analysts not to equate variation in the ob- 2  0. Note that the 95 percent confidence interval for
served effect sizes with true effect size heterogeneity. The the average effect size is now
statistic  is similar to H in that it can also be useful in
_______
comparing syntheses using different outcome metrics.
 0  1.96 * Vr(  0)  .083  1.96 * 0.052
However, like H, it will increase with study sample sizes
and therefore should not be regarded as pure measure of  (0.019, 0.185). (16.18)
effect size variation.
a bit wider than the 95 percent confidence interval based
16.5.3 How Can We Make Valid Inferences on the assumption 2  0 (see Table 16.2).
About the Average Effect Size When
the True Effect Sizes Vary?
16.5.4 Why Do Study Effects Vary?
Recall that fitting model 16.5 provided a point estimate of
 0  .060 under the assumption of constant true effect Having discovered significant variability in the true effect
sizes (2  0). In contrast, we estimated  0  0.083 under sizes, the question naturally arises about what characteris-
the assumption 2  0.019. The formula for computing tics of studies might explain this variation. As mentioned
this estimate is in section 16.3, theoretical considerations suggested that
the experimental treatment would not take hold if teach-
/
k k
 0  vi*1Ti v *1
i , vi*   2  vi . (16.15) ers knew their students well before researchers provided
i 1 i 1 them false information about which children were most
The latter estimate is more credible because the evi- likely to exhibit rapid cognitive growth. This reasoning
dence strongly suggested 2  0, in which case the esti- supports an essentially a priori hypothesis that weeks of
mate based on the assumption 2  0, though consistent, contact between teachers and children before the experi-
is not efficient. Thus, our finding that 2  0 encouraged mental manipulation would negatively predict the true
us in our case to modestly increase the point estimate of effect size. Recall that figure 16.1 provided visual em-
the average true effect size. pirical evidence in support of this hypothesis. To test this
Perhaps more consequentially, under the assumption hypothesis, we pose the model
2  0, model 16.7 provided a test of the null hypothesis
H0 : 0  0 with t  1.65. The problem is that the standard Ti  0  1Xi  ui  ei ui  ei  N(0, vi*), (16.19)
error used in this test is likely too small because it fails to
reflect the extra variation in the effect size estimates when where Xi  0,1,2 indicating the corresponding number of
2  0, as the evidence suggests is the case for our data. weeks and Xi  3 indicating three or more weeks; and
Note that estimate 16.15 is a precision-weighted average vi*  2  vi where 2  Var (X) is now the conditional
where the estimated precisions are the reciprocals of the variance of the true effect sizes controlling for the covari-
estimated study specific variances vi*   2  vi. The esti- ate X  weeks of prior contact. Based on estimation the-
mated precision of this estimate is (approximately) sum ory under restricted maximum likelihood (see section
k
of the estimated precisions, Precison(  0)  vi*1, so that 16A), we find
i1
the estimated sampling variance of (16.15) is  0  0.407, se  0.087, t  4.68
1  1  0.157, se  0.036, t  4.90
( ) (16.20)
k
Vr(  0)  vi*1 . (16.16)
i 1   0.000.
2

304 STATISTICALLY COMBINING EFFECT SIZES

These results (see table 16.2) provide quite strong evi- 16.5.6.1 Unconditional Shrinkage Estimation A
dence supporting the hypothesis that weeks of prior con- plausible alternative is the estimated average value of i,
tact are negatively related with effect sizes. The model that is 0, based on the unconditional model 16.10. In
suggests that, when X  0, the expected effect size is 0.41 particular, if 2 were known to be small, every i would
in standard deviation units. For each week of additional likely be near 0, indicating that  0 might be a much bet-
contact, we can expect the study effect size to diminish ter estimate of each i, than Ti is. Of course,  0 would be
by about 0.16 standard deviation units until week X  3, a bad choice as an estimator of each i if 2 were known
after which the effect size remains near zero. to be large, particularly if Ti is highly reliable. A compro-
mise estimator is the conditional (posterior) mean

16.5.5 How Effective Are Such Models E(i  Ti, 


0,  2)  i*  iTi  (1  i) 0, (16.23)
in Accounting for Effect Size Variation?
where i is the estimated reliability of Ti as an estimate of
Our point estimate of the conditional variance of the true
i (defined in equation 16.13). As i approaches its maxi-
effect sizes given X is 0.000 rounded to three decimal
mum of 1.0, all of the weight is assigned to the highly
points. The inference that this variance is near zero is
reliable study-specific estimate Ti. However, if study is
reinforced by testing the null hypothesis H0: 2  0 by
information is not particularly reliable, more weight is
means of the test of conditional homogeneity
assigned to the central tendency  0. An alternative form
k i* is also instructive about its properties. Simple algebra
Q1  vi1(Ti  0  1Xi)2  , df 17, p  .50. shows that
i1
(16.21) v1T  2
i*  ____________
i i 
v1  2
0
, (16.24)
i 
where  0,  1 are estimated based on the assumption
2  0. The evidence comfortably supports retention of a precision-weighted average. We can regard  2 as the
the null hypothesis that, within studies having the same estimated precision of 0 viewed as an estimator of i. If
value of X, the true effect sizes are invariant. the precision of the data from study i is high, that is, if vi1
We can compare the conditional variance of the true is large, i* places large weight on the study-specific esti-
effect sizes to the total variance of the effect sizes to com- mate Ti. However, if the true effect sizes are concentrated
pute an approximate variance explained statistic: about their central tendency,  2 will be large and the
central tendency 0 will be accorded large weight in com-


2(total)  2(given X) ____________
____________________ 0.019  0.000 posing i*. Because i* tends to be closer to the central
 2(total)
  1.000 (16.22)
0.019 tendency than Ti does, i* is known as a shrinkage estima-
tor because it shrinks outlying values of Ti toward the
suggesting in our case that essentially all of the variance
mean (Morris 1983). Such estimators are routinely com-
in the true effect sizes is explained by the model. The
puted and can be saved in a file using standard software
reader should be cautioned that this estimate is uncertain
packages for hierarchical linear models.
as the null hypothesis 2  0, though retained, is not
This shrinkage can readily be seen in table 16.2. We
known to be true. Nevertheless, the results suggest that X
see that the estimates Ti range between 0.14 and 1.18,
quite effectively explains the random effects variance.
seeming to imply that modest negative effects of teacher
expectancy (as low as 0.14) and extraordinarily large
16.5.6 What Is the Best Estimate of the Effect effects (as large as 1.18 standard deviation units) are
for Each Study? found in these nineteen studies. We know, however, that
much of this variation is attributable to noise. Recall that
The estimate Ti is an unbiased estimate of i for each the mean value of i is 0.37, suggesting that only about
study i with precision equal to the reciprocal of the vari- 37 percent of the variation in the estimates T reflects vari-
ance vi. Although certainly a reasonable estimate, Ti would ation in the underlying true effect size . In particular,
typically be inaccurate when vi is large, reflecting that study 4 provided the estimate T4  1.18. However, that
study i used a small sample size. Our analysis suggests study had the largest sampling variance in the entire data
two alternatives. set (note v4  0.139, and see table 16.1). We can speculate
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 305

0.250
4 0.50
0.200
0.30
0.150
thetastar

Intercept
0.100 0.10
0.050
0.10
0.000

0.050 0.30
0.00 0.50 1.00
T 0 5.25 10.50 15.75 21.00

Figure 16.2 Empirical Bayes Estimates i* Against Study- Figure 16.3 Bayes 95 percent Posterior Confidence Inter-
Specific Estimates Ti (horizontal axis). vals, Simple Random-Effects Model
source: Authors compilation. source: Authors compilation.
notes: Note that study 4 provided an unusually large study-specific
estimate, T4  1.18. However, the empirical Bayes estimate *4  0.21
is much smaller, reflecting the fact that T4 is quite unreliable.
their central values. The widest interval, not surprisingly,
is associated with study 4.
The intervals in figure 16.3 are based on the assump-
that this apparent outlying positive effect size reflects tion that the parameter 2 is equal to its estimated value,
noise. 2 and that 0  0. Indeed the term empirical Bayes is

The empirical Bayes estimates, i*, i  1, . . . , k, having rooted in the idea that the parameters of the prior distri-
experienced shrinkage to the mean proportional to their bution of the true effect size i are estimated empirically
unreliability, are much more concentrated, displaying a from the data. Basing i* on the approximation 2   2 is
range of 0.03 to 0.25, suggesting that most of the effects a reasonable approximation when the number of studies,
are small and positive. These estimates are plotted against k, is large. In our case, with k  19, this approximation is
the study-specific estimates in figure 16.2. The experience under some doubt. Raudenbush and Bryk re-estimated
of study 4 is instructive. Although we found T4  1.18, we the posterior distribution of i using a fully Bayesian ap-
see that *4  .21 (case 4 is highlighted in figure 16.2). The proach that takes into account the uncertainty in the data
seemingly outlying value of T4 has been severely shrunk about 2 (2002, chapter 13; for an extensive discussion of
toward the mean. This result is not surprising when we this issue, see Seltzer 1993).5
inspect the reliability of T4, that is, 4   2/(2  v4)  The reader may have noted that the range of the em-
0.019/(0.019  0.139)  0.12, the lowest in the data set. pirical Bayes estimates (from 0.03 to 0.25) is narrower
How much uncertainty is associated with the empirical than the 95 percent plausible value interval (see table
Bayes point estimates i* of the effect size? Recall from 16.2) for i given by (0.23, 0.41). This is a typical re-
equation 16.24 that i* is a precision-weighted average. sult: the empirical Bayes estimates are shrinkage estima-
The precision of this estimate is (approximately) the sum tors. These shrinkage estimators reduce the expected
of the precisions, that is, Precision(iTi,  0,   )  vi 
2 1 mean-squared error of the estimates but in doing so they

2
 so that the approximate posterior variance is tend to overshrink in the sense that the ensemble of em-
pirical Bayes estimates i*, i  1, . . . , k, will generally be
Var(i  Ti,  0,  2)  (vi1   2)1. (16.25) more condensed than the distribution of the true effect
sizes i, i  1, . . . , k. This bias toward the mean implies
Figure 16.3 displays the 95 percent posterior confidence i* will underestimate truly large positive values of i and
intervals for the nineteen effect sizes, rank ordered by overestimate truly large negative values of i. One suffers
306 STATISTICALLY COMBINING EFFECT SIZES

some bias to obtain small posterior variance leading to a 0.64


small mean-squared error. That is, E[(i*  i)2i]  Bias2
(i*i)  Var(i*i) will be small because the penalty as-
sociated with the bias will be offset by the reduction in
0.43
variance. Indeed, Raudenbush (1988) showed that the
ratio of E[(i*  i)2i] to the mean-squared error of the

Intercept
conventional estimate, that is E [(Ti  i)2i] is, on aver-
age, approximately i. The lower the reliability of Ti, then, 0.23
the greater is the benefit of using the empirical Bayes
estimator i*.
16.5.6.2 Conditional Shrinkage Estimation The em- 0.03
pirical Bayes estimators discussed so far and plotted in
figures 16.2 and 16.3 are based on the model i  0  ui,
ui ~ N(0, 2). This model assumes that the true effect sizes
0.17
are exchangeable, that is, we have no prior information to
0 5.25 10.50 15.75 21.00
suspect that any one effect size is larger than any other.
This assumption of exchangeability is refutable based on Figure 16.4 Bayes 95 percent Posterior Confidence Inter-
earlier theory and data in figure 16.1, suggesting that the vals, Mixed-Effects Model
duration of contact between teachers and children prior to source: Authors compilation.
the experiment will negatively predict the true effect size.
This reasoning encouraged us to estimate the conditional
model i  0  iXi  ui, ui ~ N(0, 2), where Xi  weeks estimate of the posterior variance, given the regression
of prior contact, and the random effects ui, i  1, . . . , k , coefficients, is now Var (iTi, Xi,  0,  1,2)  (v1
i  2)1
are now conditionally exchangeable. That is, once we 2/(1 2v1
i )0.000 when  2
  0.000. This result also
take weeks of previous contact into account, the random has implications for our belief about the range of plausible
effects are exchangeable. The results, shown in table 16.2, effect sizes. We see in table 16.2 that the empirical Bayes
provided strong support for including Xi in the model. estimates range from 0.07 (when X  3) to 0.41 (when
This finding suggests an alternative and presumably su- X  0). The 95 percent plausible value interval is the same,
perior empirical Bayes estimate of each studys effect that is (0.07, 0,41).6
size:

E(i  Ti, Xi,  0,  2)  i* 16.6 SUMMARY AND CONCLUSIONS


 iTi  (1 i)(  0  1Xi). (16.26) 16.6.1 Advantages of the Approach
Equation 16.27 shrinks outlying values of Ti not to- This chapter has described several potential advantages in
ward the overall mean but rather toward the predicted using a random effects approach to quantitative research
value  0  1Xi. The shrinkage will be large when the re- synthesis. First, the random effects conceptualization is
liability i  
2/(2  vi) is small. Recall that, after con- consistent with standard scientific aims of generalization:
trolling for Xi, we found  2  0.000. Our reliability is studies under synthesis can be viewed as representative
now conditional; that is, i is now the reliability with of a larger population or universe of implementations of a
which we can discriminate between studies that share the treatment or observations of a relationship.
same number of Xi  weeks of prior contact. The evidence Second, the random effects approach allows a parsimo-
suggests that such discrimination is not possible based on nious summary of results when the number of studies is
the data, given that the study effect sizes within levels of very large. In such large-sample syntheses, it may be im-
Xi appear essentially homogeneous. possible to formulate fully adequate explanatory models
The result, then, is i*  0  1Xi. The 95 percent pos- for effects of study characteristics on study outcomes. The
terior confidence intervals for these conditional shrkinage random effects approach enables the investigator to repre-
estimators are plotted in figure 16.4. All studies having sent the unexplained variation in the true effect sizes with
the same value of Xi share the same value of i* because our a single parameter, 2, and to make summary statements
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 307

about the range of likely effects in a particular area of procedure for sensitivity analysis in research synthesis
research even when the sources of discrepancies in study that allows the investigator to gauge the likely conse-
results are poorly understood. In addition, re-estimating quence of uncertainty about the variance for inferences
the random effects variance after adding explanatory about means and regression coefficients in effect size
variables makes a kind of percentage of variance explained analyses (1985). In appendix 16B.2, I review an approach
criterion for assessing the adequacy of the explanatory proposed by Gudio Knapp and Jocham Hartung that ap-
models possible. pears to offer help in insuring accurate standard errors
Third, the statistical methods described in this chapter (2003).
allow the investigator to draw inferences about average 16.6.2.2 Failure of Parametric Assumptions Stan-
effect sizes that account for the extra uncertainly that dard practice requires the assumption that the random ef-
arises when the studies under synthesis are heterogeneous fects are normally distributed with constant variance. This
in their true effects. This uncertainty arises because, even assumption is difficult to assess when the number of stud-
despite concerted attempts at replication, study contexts, ies is small. It should be emphasized that this assumption
treatments, and procedures will inevitably vary in many is in addition to the usual assumption that the error of Ti
ways that may influence results. as an estimate of i is normal. Hedges investigated this
Fourth, the random effects approach facilitates im- latter assumption intensively (1981). Michael Seltzer de-
proved estimation of the effect of each study using empiri- veloped a robust estimation procedure that allows the
cal Bayes or Bayes shrinkage. To the extent each studys analyst to assume that the random effects are t- distributed
estimated effect size lacks reliability, one can borrow in- rather than normal (1993). As the degrees of freedom of
formation from the entire ensemble of studies to gain a the t increase, the results converge to those based on nor-
more accurate estimate for each study. mality assumption. The robust estimates are resistant to
influence by outliers among the random effects. Appen-
16.6.2 Threats to Valid Statistical Inference dix 16B.3 describes and illustrated a Huber-White robust
estimator that is useful in ensuring that standard errors,
The problem is that the advantages of the random effects tests, and confidence intervals are not sensitive to para-
conceptualization are purchased at the price of two extra metric assumptions. However, this approach works best
assumptions. First, the approach we have presented treats when k is moderately large.
the random effects variance 2 as if it were known, Alternative methods of point estimation have charac-
though, in fact, it must be estimated from the data. Sec- teristic strengths and limitations. Maximum likelihood,
ond, to allow for hypothesis testing, we have assumed either full or restricted, requires distributional assump-
that the random effects are normally distributed with con- tions about the random effects but provides efficient esti-
stant variance given the predictors in the between-study mates in large samples when these assumptions are met.
model. The method of moments requires assumptions only about
16.6.2.1 Uncertainly About the Variance In stan- the mean and variance of the random effects. However, the
dard practice and in the methods presented here, estima- estimators, while consistent, are not efficient. I illustrate
tion of the overall mean effect size or regression coeffi- their application in appendix 16A.
cients are weighted estimates and the weights depend on 16.6.2.3 Problems of Model Misspecification and
the unknown variance of the random effects. The truth is Capitalizing on Chance As in any linear model analy-
that this variance is uncertain, especially in small sample sis, estimates of regression coefficients will be biased
syntheses, so that the weights are in reality random vari- if confounding variablesexogenous predictors related
ables that inject uncertainty into inferences. Conventional both to the explanatory variables in the model and to the
methods of inference do not account for this source of random effectsare not included in the model. This is
uncertainty, leading to confidence intervals for regression the classic problem of underfitting. The opposite problem
coefficients and individual study effect sizes that are too is overfitting, when the investigator includes a large num-
short. This problem is most likely to arise when k is small ber of candidate explanatory variables as predictors, some
and when the sampling variances vi are highly variable. of which may seem to improve prediction strictly as a
Donald Rubin described this problem and approached result of chance. The best solution to both problems
to addressing it in the context of parallel randomized seems to be to develop strong a priori theory about study
experiments (1981). Raudenbush and Bryk described a characteristics that might affect outcomes and include as
308 STATISTICALLY COMBINING EFFECT SIZES

many studies as possible. John Hunter and Frank Schmidt the next appendix, I consider and compare the alterna-
have warned especially against the overfitting that oc- tives. They differ along two dimensions. First, I consider
curs when investigators carry out an extensive fishing three approaches to point estimation: full maximum like-
expedition for explanatory variables even when there lihood, restricted maximum likelihood, and the method
may be little true variance in the effect sizes (1990). In of moments. I then review three alternatives for comput-
the absence of strong a priori theory, they advocate sim- ing uncertainty estimates such as standard errors, hypoth-
ple summaries using random effects approach where the esis tests, and confidence intervals.
key parameters are the mean and variance of the true ef- This appendix presents general results for point esti-
fects sizes. Raudenbush described a method for deriving mation and illustrates them with application to the teacher
a priori hypotheses by carefully reading discussion sec- expectancy data. Tables 16.3 and 16.4 collect these results
tions of early studies to find speculations testable on the in a concise form for the simple random effects model
basis of later studies (1983). T  0  ui  ei.
16.6.2.4 Multiple Effect Sizes Most studies use sev-
eral outcome variables. Such variables will tend to be
correlated, and syntheses that fail to account for the re- 16A.1 Full Maximum Likelihood
sulting dependency may commit statistical errors (see
Hedges and Olkin 1985, chapter 19; Rosenthal and Rubin Full maximum likelihood (FMLE) is based on the joint
1986; Kalaian and Raudenbush 1996). likelihood of the regression coefficients and the variance
16.6.2.5 Inferences About Particular Effect Sizes In 2. It ensures large-sample efficiency of these inferences
some cases, the investigator may wish to conceive of the based on distributional assumptions about the true effect
effect sizes as random and yet make inferences about par- sizes. The key criticism is that estimates of 2 do not re-
ticular effects. For example, it may be of interest to assess flect uncertainty about the unknown regression coefficients
the largest effect size in a series of studies or to identify (the s). The estimates of 2 will therefore be negatively
for more intensive investigation effect sizes that are some- biased. Note that the FMLE estimates of 2 are smaller
how unusual. The need to make such inferences, how- than those provided by other methods in table 16.3. This
ever, poses a problem. On the one hand, each studys data bias may lead to overly liberal hypothesis tests and confi-
may yield an imprecise based on chance alone. On the dence intervals for the regression parameters. A counter-
other, the grand mean effect or the value predicted on the argument is that FMLE provides estimates of 2 that have
basis of a regression model may produce a poor indicator smaller mean-squared error than other approaches do, as
of the effects of any particular study. Rebecca DerSimo- long as the number of predictors in the model, p  1, is
nian and Nan Laird (1986), Donald Rubin (1981), and not too large. FMLE achieves efficiency by reducing
Stephen Raudenbush and Anthony Bryk (1985) have all sampling variance at the expense of a negative bias that
described methods for combing data from all studies to diminishes rapidly as k increases. One FMLE benefit is
strengthen inferences about the effects in each particular that any two nested models can be compared using the
study to create a reasonably accurate picture of the effects likelihood ratio statistic. However, the likelihood ratio
of a collection of studies. I illustrated these ideas in sec- test is a large-sample approach and there are some ques-
tion 16.5.6. tions about its validity in our case with k  19.
To simplify the presentation, we express the general
model (equation 16.3) in matrix notation:
16A APPENDIX A: ALTERNATIVE APPROACHES TO
POINT ESTIMATION
Ti  XiT  ui  ei, ui  N(0, 2), ei  N(0,vi) (16.27)
All the results presented so far (see table 16.2) were based
on restricted maximum likelihood estimation of the be- where Xi  (1, Xi1, . . . , Xip)T is a column vector having
tween-study variance, 2, with weighted least squares dimension (p  1) of known characteristics of study i
estimation of the regression coefficients 0, 1 , and em- used as predictors of the effect size;   (0, . . . , p)T is
pirical Bayes estimation of the study-specific effect sizes a column vector of dimension p  1 containing unknown
i, i  1, . . . , k. Many would agree that this is a reason- regression coefficients; T  (T1, . . . ,Tk )T is a column vec-
able approach. However, there are alternative approaches, tor of dimension k containing the estimated effect sizes;
and each has merits as well as vulnerabilities. In this and and vi*  vi  2 is the variance of Ti.
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 309

Table 16.3 Statistical Inferences for 0

95% Confidence
Method of Estimation  2  0 SE t   0 / (SE) Interval for 0

Fixed Effects MLE


Conventionala 0.000d 0.060 0.036 1.65 (0.016, 0.136)
Quasi-Fb 0.000 0.060 0.051 1.17 (0.047, 0.167)
Huber-Whitec 0.000 0.060 0.040 1.51 (0.024, 0.136)
Full MLE
Conventionala 0.012 0.078 0.048 1.64 (0.016, 0.136)
Quasi-Fb 0.012 0.078 0.061 1.27 (0.047, 0.167)
Huber-Whitec 0.012 0.078 0.048 1.62 (0.024, 0.136)
Restricted MLE
Conventionala 0.018 0.083 0.052 1.62 (0.025, 0.193)
Quasi-Fb 0.018 0.083 0.063 1.32 (0.048, 0.216)
Huber-White 0.018 0.083 0.050 1.68 (0.021, 0.189)
Method of Moments
Conventionala 0.026 0.089 0.056 1.60 (0.028, 0.206)
Quasi-Fb 0.026 0.089 0.064 1.40 (0.044, 0.222)
Huber-Whitec 0.026 0.089 0.052 1.71 (0.020, 0.199)

source: Authors compilation.


notes: For the simple random effects model using three methods of point estimation and three methods of
estimating uncertainty.
1
k

(
0)  (
aConventional Vr(  2  vi)1 .
i1
 ) k

(  v )
2
 i
1
(Ti  0)2
bQuasi-F approach is equivalent to assuming Vr(  0)  q  i1
_______________ .
k

k (k 1)(
2vi)1
(  v )
(Ti  0)2
2
 i
2 i1

cHuber-White Vr( 
0)  i1
_______________ .

( )
k 2

( 2vi)1i1

d For the fixed effects model, 2  0 by assumption.

The log-likelihood for FMLE is the density of the ob- The resulting estimating equations are
served effect sizes regarded as a function of the parame- 1 k

( )
k
ters given the sample:
k


i 1
vi*1XiXiT v
i 1
*1
i XiTi (16.31)
LF(, 2;T)  constant  ln(vi*1)
i 1 and
k
 vi*1(Ti  XiT)2.
k k

i1
(16.28)  2  vi*2[(Ti  XiT)2  vi] / vi*2. (16.32)
i 1 i 1
The likelihood is maximized by solving
k
These estimating equations have the interesting form
dLF
____

d  i 1 i Xi(Ti  Xi )  0
v *1 T
(16.29) of weighted least squares estimates. The weights for esti-
mating the regression coefficients are the precisions vi*1,
k k
the reciprocals of E (Ti  XiT)2, while the weights for
dLF
____
d2
  vi*1  vi*2(Ti  XiT)2  0.
i 1 i 1
(16.30) estimating the variances are vi*2, inversely proportional
to Var (Ti  XiT)2.
310 STATISTICALLY COMBINING EFFECT SIZES

Table 16.4 Statistical Inferences for 2 biased estimates and possibly improving the standard er-
rors, hypothesis tests, and intervals for . However, the
95% Plausible RMLE estimates of 2 will be slightly less efficient than
Method of Reliability, Value Interval those based on FMLE unless the number of parameters,
Estimation 
2 a for ib p  1, in  is comparatively large.
The log-likelihood for RMLE is the density of the ob-
Full MLE 0.012 .29 (0.14, 0.30) served effect sizes regarded as a function of 2 alone,
Restricted MLE 0.018 .37 (0.19, 0.35)
Method of Moments 0.026 .51 (0.23, 0.41)
given the sample (Raudenbush and Bryk 1985):
k

source: Authors compilation. LF (2;T)  constant  ln(vi*1)


notes: For the simple random effects model using three methods of i 1
point estimation.

( )
k
19
 ln vi*1XiXiT

a i, 
i   2/(
2  vi) i 1
i 1
b
95% Plausible value interval for i  0  1.96*
. k
 vi*1(Ti  XiT ).
(16.35)
i 1

Of course, the two equations cannot be solved in closed The log-likelihood is maximized by solving
form. However, it is easy to write a small program that 1

( )
k k
begins with an initial estimate of , for example using dLR
weighted least squares with 2 set to zero in computing
____
d2
  vi*1 
i 1 i 1
vi*1XiXiT
the weights, so that the weight vi*1 is set initially to vi1. k
These estimates are then used to produce an initial esti-  vi*2(Ti  XiT )
2  0. (16.36)
mate of 2 using weights vi*2 set initially to vi2. This i1
initial estimate of 2 is then used in constructing new The resulting estimating equation is
weights with which to re-estimate  and 2 using 16.31
1 k

[( ) ]
k k
and 16.32. This procedure continues until convergence.
Alternatively, standard statistical programs for mixed v
i 1
i
2  vi] tr vi*1Xi XiT vi*2Xi XiT
[(Ti  XiT )
*2

i 1 i 1
 

2 ______________________________________ .
models that allow known but varying level-1 variances k

can be used to estimate this model. I used HLM version v


i 1
*2
i

6.0 for the computation of the results for FMLE shown in (16.37)
table 16.3. Note that this estimating equation is equivalent to that
For the simple random effects model (table 16.3), the for the case of FMLE (equation 16.32) except for the sec-
estimating equations 16.31 and 16.32 reduce to ond term in the numerator of 16.37. It is instructive to com-
pare the form of these in the balanced case, that is vi  v for
/
k k
 0  vi*1Ti v *1
i (16.33) all i. In that case, the FMLE estimating equation becomes
i 1 i 1
k
and
(T  X )2  vi
T
i

  _____________
i 1

/
k k 2
 2  vi*2[(Ti  0)2  vi] vi*2. (16.34)  k
i 1 i 1
and the RMLE estimating equation becomes
16A.2 Restricted Maximum Likelihood k

(T  X i
T
i  )2  v
Restricted maximum likelihood (RMLE) is based on the 2  _____________
i 1
kp1
likelihood of the variance 2 given the data, with the re-
gression coefficients removed through integration (Demp- where   ( XiXiT )1 XiTi. Note that the denominator
tser, Rubin, and Tsutakawa 1981).7 The advantage of of the RMLE estimator is kp1 rather than k, as it is in
RMLE versus FMLE is that RMLE estimates of 2 take the case of the FMLE estimator, revealing how RMLE
into account the uncertainty about , yielding nearly un- accounts for the uncertainty associated with having to
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 311

estimate . These equations for the balanced case are and the expected residual sum of squares is
solvable in closed form. k
In the typical case of unbalanced data, the RMLE E(RSS)  E(Ti  XiT 
OLS)2
equation (16.37) for the new 2 depends on the current i 1

estimate 2 because 2 is needed for the weights. This k

problem can easily be solved using an iterative program  vi  tr (S)  (k  p  1) 2 (16.40)
i 1
very similar to that used for FMLE. However, I used
HLM version 6.0 for the computation of the results for yielding the estimating equation
FMLE shown in table 16.3. k

For the simple random effects model, the estimating RSS  vi  tr(S)
equation for 2 reduces to  2  _____________
i 1
kp1
, (16.41)
k

v *2
[(Ti  0)2  vi] 1 where
( )
i k
 2  _______________
i 1
 vi*1 . (16.38) 1 k

[( ) ]
k k
v
i 1
*2
i
i 1
S XiXiT viXiXiT .
i 1 i 1

/
k k
where  0  vi*1Ti v *1
i . In the case of the simple random effects model, 16.41
i 1 i 1 reduces to
k __
16A.3 Method of Moments (T  T)2 i

The key benefit of a likelihood-based approach is that  2  i________


1
k 1  v
estimates will be efficient and hypothesis tests powerful
__ k k
in large samples. To obtain these benefits, however, one where T  Ti / k and v  vi / k.
must assume a distribution for the random effects ui, i 1 i 1
i  1, . . . , k. We have assumed normality of these ran- 16A.3.2 MOM Using WLS with Known Weights
dom effects throughout. The WLS estimates with weights vi1 are
A popular alternative approach is the method of mo-
1 k

( )
k
ments (MOM), which requires assumptions about the first
two moments of ui but not the probability density. For this

WLS vi1XiXiT
i 1
v
i 1
1
iXiTiT, (16.42)
purpose, we assume that ui, i  1, . . . , k are mutually in-
dependent with zero means and common variance 2. and the expected weighted residual sum of squares is
The approach will ensure consistent but not necessarily k

efficient estimates of the parameters. E(WRSS)  Evi1( Ti  XiT


)2   2 tr(M)  k  p,
WLS 
i 1
MOM uses a two-step approach. One begins with a
comparatively simple regression to estimate . One then yielding the estimating equation
equates the residual sum of squares from this regression WRSS  (k  p  1)
to its expected value and solves for 2. This estimate of  2  ________________
tr(M) (16.43)
2 is then used to construct weights v*1i  (2  vi)1 to
be used in a second weighted least squares regression. I where
considered two approaches to the first step, the computa- 1 k

[( ) ]
k k
tion of the simple regression: ordinary least squares (OLS tr(M)  vi1  tr v 1
i XiXiT v 2
i XiXiT .
and weighted least squares using the known vi1 as weights i 1 i 1 i1

(WLS, known weights). In the case of the simple random effects model, 16.43
16A.3.1 MOM Using OLS Regression at Stage 1 reduces to
The OLS estimates for the general model are
k __
1 k
v
(Ti  T )2  (k  1)
( )
k 1


OLS  X T ,
i
T T
XiX i i i (16.39)   
2 i_________________
1
i 1 i 1 tr (M)
312 STATISTICALLY COMBINING EFFECT SIZES

where the wider plausible value interval for RMLE than under
FMLE. The reliabilities differ as well. Under MOM, about
/ /
k k k __ k k
tr(M)  vi1  vi2 vi1 and T  vi1Ti vi1. 51 percent of the variation in the observed effect sizes
i 1 i 1 i 1 i 1 i 1 appears to reflect variation in the true effect sizes. The
corresponding values for RMLE and FMLE are 37 per-
16A.4 Results cent and 29 percent, respectively.

The results for the simple random effects model based on 16B APPENDIX B: ALTERNATIVE APPROACHES
the three approaches FMLE, RMLE, and MOM are pro- TO UNCERTAINTY ESTIMATION
vided in table 16.3. We focus on the results for the simple
random effects model, for which there is clear evidence This appendix describes the conventional approach to
against the null hypothesis H0: 2  0 . Recall that when computing hypothesis tests and confidence intervals
Xi  weeks of prior contact is included in the model, we based on large-sample theory. It then reviews two alterna-
can comfortably retain H0: 2  0 and the differences tive approaches: a quasi-F statistic and a Huber-White
among FMLE, RMLE, and MOM are no longer of great robust variance estimator. I illustrate application of these
interest. in the case of the simple random effects model. The re-
For MOM, the OLS approach produced an unrealisti- sults are presented in table 16.3.
cally large mean effect size as well as an unrealistically
large estimate of 2. This problem seems to reflect the 16B.1 Conventional Approach
fact that the illustrative data have widely varying sam-
pling variances vi and some outlying values of Ti associ- Inferences about the central tendency of the true effect
ated with studies having large vi. In this setting, the OLS sizes, captured in the regression coefficients , are con-
results, which weight each studys results equally in com- ventionally based on the large-sample variance of . As
puting the initial estimates, appear suboptimal. As a re- k
, the estimated weights v i*1  (vi   2)1 converge
sult, the MOM results in table 16.3 are those based on to the true weights vi*1  (vi  
2)1, and the estimated
WLS with known weights, vi1. This experience suggests variance of the regression coefficients converges to
caution in using the OLS approach despite its presence in 1

( )
k
the literature (for example, Hedges 1983).
I focus on the conventional approach to estimating stan-
  2  2) 
Var ( vi*1XiXiT
i 1
.

dard errors because the two other approaches (Quasi-F


The conventional inferences then treat the square roots
and Huber-White) will be considered in appendix 16B.
of the diagonal elements of Var (  2  2) as known
As expected, the point estimate of 2 is larger under
standard errors in computing hypothesis tests and confi-
RMLE (2  0.019) than under FMLE (2  0.012). As a
dence intervals.
result, the conventional tests of the null hypothesis H0 :
For example, in the case of the simple______________
random effects
0  0 are a bit more conservative under RMLE than
model, section 16.5 used the ratio  0/Var (  0 
2  2)
under FMLE, and the conventional 95 percent confidence
to test the null hypothesis H0 : 0  0, where Var (0  2 
intervals are a bit wider under RMLE than under FMLE. 1

( )
k
MOM based on WLS with known weights produces a
larger point estimate of 2 (2  0.026) than either RMLE
2)  v
i1
*1
i . One difficulty with this approach is that,
or FMLE and therefore produces somewhat more conser- when k is modest, the assumption that v i*1  (vi   2)1
vative inferences about 0. have converged to the true weights vi*1  (vi  2)1 and
The main conclusion to be drawn from the comparison therefore that Var( 
2  2)  Var( )
will be difficult to
is that the conventional inferences about 0 are quite sim- defend, leading to overly liberal tests H0 : 0  0 and con-
ilar across FMLE, RMLE, and MOM for these data. What fidence intervals for 0.
differs across the three methods are the inferences about
2 itself, and, relatedly, about the range of plausible val- 16B.2 A Quasi-F Statistic
ues of the true effect sizes i. These differences are appar-
ent in table 16.4. Note the wider 95 percent plausible To cope with this problem, Jocham Hartung and Guido
value interval for i under MOM than under RMLE and Knapp proposed an alternative test for the simple random
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 313

effects model and extended this approach to the case 16.44 with 0 set to 0 and vi*  v i*. We can therefore regard
where there is a single between-study predictor X (Har- 16.44 as a quasi-F, that is, an approximation to F. The
tung and Knapp 2004; Knapp and Hartung 2003). They likely benefit of this approach stems from the fact that the
provided evidence that under certain simulated condi- empirical weighted sum of squared residuals strongly in-
tions, their approach produced more accurate tests and fluence the test, and these are somewhat robust to inac-
intervals than does the conventional approach. curate estimates of 2.
The approach appears to be based on what might be The approach can be generalized to any linear hypoth-
called a quasi-F statistic. For the simple random effects esis. Define H0:CT  0, where C is a p  1 by c contrast
model we compute matrix, c being the number of linear contrasts of interest
among the p  1 regression parameters. The Knapp-Har-
F *(1, k  p  1)   02 / q , (16.44) tung approach defines the test

where F *(c, k  p  1)
1 1

[( ) ]
k
k k

vi*1(Ti  0)2 \ vi*1 ()


T C CT
vi*1XiXiT C ()
1 / c

q  _________________
i 1 i 1 (16.45) 
i 1
________________________________ . (16.47)
k1
k

vi (Ti  Xi ) /(k  p  1)
*1

i 1
2

with

/ The results of that approach (table 16.3) show that,


k k
 0  vi*1Ti v *1
i . regardless of the method of point estimation, the quasi-F
i 1 i 1
produces significantly more conservative inferences
We can motivate this approach as follows. We know that, (smaller t-ratios, wider confidence intervals) than does
if the weights vi*1 were known, the estimate the conventional approach. These may be more realistic
than the conventional inferences.

/
k k
 0  vi*1Ti v *1
i
i 1 i 1
16B.3 Huber-White Variances
would have variance
The conventional approach and the quasi-F approach rely

/
k
on parametric assumptions about the distribution of the
Var (  0)  1 vi*1.
i 1
random effects. In particular, in deriving tests and confi-
dence intervals, we have assumed that these are normally
Therefore the statistic distributed with constant variance 2. For example, we
k have assumed that the random effects variance is constant
(  0  0)2 vi*1 at every X  weeks of prior contact, an assumption that
i 1
may be false. The Huber-White approach produces stan-
would be distributed as chi-square with one degree of free- dard errors that do not rely on these assumptions (White
dom. Moreover, the weighted sum of squared residuals 1980). Consider the regression of T on X using any known
1 k

( )
k

w X X w X T
k
weights wi. The estimator 
 T T
q  v *1
i (Ti  0)2 i 1
i i i
i 1
i i i
i 1
has variance
would be distributed as chi-square with k1 degrees of
1 1
freedom. Hence the ratio
( ) ( )
k k k

k Var( )
 wiXiXiT w X Var(T )X w X X
2
i i i
T
i i i
T
i
(  0  0)2vi*1 i 1 i 1 i 1

F(1, k 1)  ____________


k
i 1
(16.46) (16.48)
v
i 1
(Ti  0)2
*1
i
regardless of the true form of Var (Ti). If in fact
would be distributed as a central F statistic with degrees of
freedom (1, k1). Equation 16.46 is identical to equation wi1  Var (Ti)  vi*,
314 STATISTICALLY COMBINING EFFECT SIZES

(16.48) will reduce to the conventional expression for the population correction factors. Perhaps more impor-
variance, that is tant, the view of populations as infinite is consistent
1
with the notion of generalizing to a universe of possi-

( )
k
Var ()
 vi*Xi XiT . ble observations (Cronbach et al. 1972) and is there-
i 1 fore consistent with standard scientific aims.
3. The estimate is best in the sense that it has minimum
The Huber-White approach avoids this step, substituting variance within the class of unbiased estimators.
the empirical estimator Vr(Ti)  (Ti  XiT )
2 thus avoid-
4. See section 16.6 for a discussion and illustration of
ing assumptions about the variance of the random effects, alternative estimation procedures.
yielding then the expression 5. In the present case, we find that the likelihood for 2
is quite symmetric about the mode, so that the empiri-
VrHW( )

cal Bayes inferences and the fully Bayes inferences
1 k 1 are quite similar (for more discussion, see Rubin 1981;
( ) ( )
k k

i 1
wiXiXiT w X (T  X
i 1
2
i i i
T
i )
2 XiT
i1
wiXiXiT .
6.
Morris 1983).
Of course 2 is not known to be zero. Nonzero values
(16.49) of 2 are plausible. We are also conditioning on point
estimates of  0,  1 in computing these plausible value
This expression is familiar in the literature on general- intervals and posterior variances. A fully Bayesian ap-
ized estimating equations as the sandwich estimator, proach would reflect the uncertainty about 2 as well
where the conventional expression for the variance is the as 0, 1 in the posterior intervals for i (Raudenbush
bread on each side of 16.49 (Liang and Zeger 1986). and Bryk 2002, chapter 13).
The results, shown in table 16.3, suggest that the Huber- 7. Let fF(T, 2) be the density of the data (equivalent to
White and conventional inferences are very similar for the FMLE likelihood). Formulate a prior distribution
our data, regardless of the method of point estimation. p(, ) where  is the prior mean of  and  is the
Apparently, parametric assumptions about the variance prior variance-covariance matrix. Then
of the random effects are not strongly affecting our infer-
ences. A problem with the Huber-White approach is that lim( 1 f (T  2)
0) R
the theory behind it requires large k. Some simulation
 lim( 1
0)  fF (Y  , 2)p(  , )d.
studies show that these inferences often are sound even
for modest k, but more research is needed (Cheong, The idea is that one is treating  in a fully Bayesian
Raudenbush, and Fotiu 2000). fashion but conditioning on fixed 2. However, there is
essentially no prior information about  so that its prior
16.7 NOTES precision 1 is essentially null. Equation 16.36 involves
the log density of the data given only the variance 2,
1. Of course, many studies are not actually based on averaging over all possible values of  so that inferences
random samples of subjects from well-defined popu- about 2 fully take into account the uncertainty about .
lations. Nevertheless, the investigators use standard Alternatively, Raudenbush and Bryk provided a classical
inferential statistical procedures. By doing so, each interpretation in which fR(T  2)  fF(Y  , 2) / g((
 2),
investigator is implicitly viewing his or her sample as where  is given by equation 16.31 (1985).
a random sample from a hypothetical population or
universe of possible observations that supply the tar- 16.8 REFERENCES
get of any generalizations that arise from the statistical
inferences. Casella, George, and Roger Berger. 1990. Statistical Infer-
2. At either stage, it is conceivable to view the population ence, 2nd ed. Pacific Grove, Calif.: Wadsworth.
as either finite or infinite. Of course, it is hard to imag- Cheong, Yuk F., Randall P. Fotiu, and Stephen W. Rau-
ine that an infinite number of studies have been con- denbush. 2001. Efficiency, and Robustness of Alternative
ducted on a particular topic. Nevertheless, in practice, Estimators for 2-, and 3- Level Models: The Case of NAEP.
populations at both levels are viewed as infinite. This Journal of Educational, and Behavioral Statistics 26(4):
simplifies analysis because one need not use finite 41129.
ANALYZING EFFECT SIZES: RANDOM-EFFECTS MODELS 315

Cronbach, Lee J., Goldine C. Gleser, Harinder Nanda, and Liang, Kung-Yee, and Scott L. Zeger. 1986. Longitudinal
Nageswari Rajaratnam. 1972. The Dependability of Beha- Data Analysis Using Generalized Linear Models. Biome-
vioral Measurements: Theory of Generalizability for Sco- trika 73(1): 1322.
res, and Profiles. New York: John Wiley & Sons. Morris, Carl N. 1983. Parametric Empirical Bayes Infer-
De Finetti, Bruno. 1964. Foresight: Its Logical Laws, Its ence: Theory, and Applications. Journal of the American
Subjective Sources. In Studies in Subjective Probability, Statistical Association 78(381): 4765.
edited by E.E. Kyburg Jr. and H.E. Smokler. New York: Raudenbush, Stephen W. 1983. Utilizing Controversy as a
John Wiley & Sons. Source of Hypotheses in Meta-Analysis: The Case of Tea-
Dempster, Arthur P., Donald B. Rubin, and Robert K. Tsuta- cher Expectancy on Pupil IQ. Evaluation Studies Review
kawa. 1981. Estimation in Covariance Components Mo- Annual. Beverly Hills, Calif.: Sage Publications.
dels. Journal of the American Statistical Association . 1984. Magnitude of Teacher Expectancy Effects
76(374): 34153. on Pupil IQ as a Function of the Credibility of Expectancy
DerSimonian, Rebecca, and Nan M. Laird. 1986. Meta- Induction: A Synthesis of Findings from 18 Experiments.
Analysis in Clinical Trials. Controlled Clinical Trials 7(3): Journal of Educational Psychology 76(1): 8597.
17788. . 1988. Educational Applications of Hierarchical
Hartung, Jocham., and Guido Knapp. 2001. On Tests of the Linear Models. Journal of Educational, and Behavioral
Overall Treatment Effect in Meta-Analysis with Normally Statistics 13(2): 85116.
Distributed Responses. Statistics in Medicine 20(12): Raudenbush, Stephen W., and Anthony S. Bryk. 1985. Em-
177182. pirical Bayes Meta-Analysis. Journal of Educational, and
Hedges, Larry V. 1981. Distribution Theory for Glasss Es- Behavioral Statistics 10(2): 24169.
timator of Effect Size, and Related Estimators. Journal of . 2002. Hierarchical Linear Models: Applications,
Educational Statistics 6(2): 107. and Data Analysis Methods. Thousand Oaks, Calif.: Sage
. 1983. A Random Effects Model for Effect Size. Publications.
Psychological Bulletin 93: 38895. Rosenthal, Robert, and Linda Jacobson. 1968. Pygmalion in
Hedges, Larry V., and Inrgam Olkin. 1983. Regression Mo- the Classroom. New York: Holt, Rinehart, and Winston.
dels in Research Synthesis. American Statistician 37: Rosenthal, Robert, and Donald B. Rubin. 1978. Interperso-
13740. nal Expectancy Effects: The First 345 Studies. Behavioral,
Higgins, Julian.T., and Simon G. Thompson. 2002. Quanti- and Brain Sciences 3: 37786.
fying Heterogeneity in Meta-Analysis. Statistics in Medi- Rubin, Donald B. 1981. Estimation in Parallel Randomized
cine 21(11): 153958. Experiments. Journal of Educational, and Behavioral Sta-
Hunter, John E., and Frank L. Schmidt. 1990. Methods of tistics 6(4): 337401.
Meta-Analysis: Correcting Error, and Bias in Research Ryan, William. 1971. Blaming the Victim. New York: Vintage.
Findings. Beverly Hills, Calif.: Sage Publications. Seltzer, Michael. 1993. Sensitivity Analysis for Fixed Ef-
Kalaian, Hripsime, and Stephen W. Raudenbush. 1996. A fects in the Hierarchical Model: A Gibbs Sampling Ap-
Multivariate Mixed Linear Model for Meta-Analysis. Psy- proach. Journal of Educational, and Behavioral Statistics
chological Methods 1(3): 22735. 18(3): 20735.
Knapp, Guido, and Jocham Hartung. 2003. Improved Tests White, Halbert. 1980. A Heteroskedasticity-Consistent Co-
for a Random Effects Meta-Regression with a Single Cova- variance Matrix Estimator, and a direct Test for Heteroske-
riate. Statistics in Medicine 22(17): 26932710. dasticity. Econometrica 48(4): 81738.
17
CORRECTING FOR THE DISTORTING EFFECTS OF
STUDY ARTIFACTS IN META-ANALYSIS

FRANK L. SCHMIDT HUY LE


University of Iowa University of Central Florida

IN-SUE OH
University of Iowa

CONTENTS
17.1 Artifacts that Distort Observed Study Results 318
17.1.1 Unsystematic Artifacts 318
17.1.1.1 Sampling Errors 318
17.1.1.2 Data Errors and Outliers 318
17.1.2 Systematic Artifacts 319
17.1.2.1 Single Artifacts 319
17.1.2.2 Multiple Artifacts 321
17.1.2.3 Numerical Illustration 322
17.2 Correcting for Attenuation-Induced Biases 324
17.2.1 Population Correlation: Attenuation and Disattenuation 324
17.2.2 Sample Correlation 324
17.3 Meta-Analysis of Corrected Correlations and Software 324
17.3.1 The Mean Corrected Correlation 325
17.3.2 Corrected Versus Uncorrected Correlations 325
17.3.3 Variance of Corrected Correlations: Procedure 325
17.4 Artifact Distribution Meta-Analysis and Software 327
17.4.1 Mean of the Corrected Correlations 328
17.4.1.1 Meta-Analysis of Attenuated Correlations 328
17.4.1.2 Correcting the Mean Correlation 328
17.4.1.3 The Mean Compound Multiplier 328
17.4.2 Correcting the Standard Deviation 329
17.4.2.1 Variance of Artifact Multiplier 330
17.4.2.2 Decomposition of the Variance of Correlations 331
17.5 Summary 331
17.6 Notes 332
17.7 References 332

317
318 STATISTICALLY COMBINING EFFECT SIZES

17.1 ARTIFACTS THAT DISTORT OBSERVED methods of calibrating each artifact and correcting for its
STUDY RESULTS effects. The procedures for correcting for these artifacts
can be complex, but software for applying them is avail-
Every study has imperfections, many of which bias the able (Schmidt and Le 2004). The procedures summarized
results. In some cases we can define precisely what a here are more fully detailed in another work, where they
methodologically ideal study would be like, and thus say are presented for both the correlation coefficient and the
that the effect size obtained from any real study will dif- standardized mean difference (d value statistic) (Hunter
fer to some extent from the value that would have been and Schmidt 2004, chapters 2, 3). This chapter focuses on
obtained had the study been methodologically perfect. applying them to correlations. The procedures described
Although it is important to estimate and eliminate bias in here can also be applied to other effect size statistics such
individual studies, it is even more important to remove as odds ratios and related risk statistics, but these applica-
such errors in research syntheses such as meta-analyses. tions have not been fully explicated.
Some authors have argued that meta-analysts should
not correct for study imperfections because the purpose 17.1.1 Unsystematic Artifacts
of meta-analysis is to provide only a description of study
findings, not an estimate of what would have been found Some artifacts produce a systematic effect on the study
in methodologically ideal studies. However, the errors effect size and some cause unsystematic (random) effects.
and biases that stem from study imperfections are artifac- Even within a single study, it is sometimes possible to
tual: they stem from imperfections in our research meth- correct for a systematic effect, though it usually requires
ods, not from the underlying relationships that are of special information to do so. Unsystematic effects usually
scientific interest (Rubin 1990). Thus, scientific questions cannot be corrected in single studies and sometimes may
are better addressed by estimates of the results that would not be correctable even at the level of meta-analysis. The
have been observed had studies been free of methodolog- two major unsystematic artifacts are sampling error and
ical biases (Cook et al. 1992; Hunter and Schmidt 2004, data errors.
3032; Rubin 1990; Schmidt 1992). For example, in cor- 17.1.1.1 Sampling Errors It is not possible to cor-
relation research the results most relevant to evaluation of rect for the effect of sampling error in a single study. The
a scientific theory are those that would be obtained from confidence interval gives an idea of the potential size of
a study using an infinitely large sample from the relevant the sampling error, but the magnitude of the sampling
population (that is, the population itself) and measures of error in any one study is unknown and hence cannot be
the independent and dependent variables that are free of corrected. However, the effects of sampling error can be
measurement error and are perfectly construct valid. Such greatly reduced or eliminated in meta-analysis if the
a study would be expected to provide an exact estimate of number of studies (k) is large enough to produce a large
the relation between constructs in the population of inter- total sample size, because sampling errors are random
est; such an estimate is maximally relevant to the testing and average out across studies. If the total sample size in
and evaluation of scientific theories, and to theory con- the meta-analysis is not large, one can still correct for the
struction. Thus corrections for biases and other errors in effects of sampling error, though the correction is less
study findings due to study imperfections, which we call precise and some smaller amount of sampling error will
artifacts, are essential to developing valid cumulative remain in the final meta-analysis results (see Hunter and
knowledge. The increasing use of estimates from meta- Schmidt 2004, chapter 9). A meta-analysis that corrects
analysis as input into causal modeling procedures (see, only for sampling error and ignores other artifacts is often
for example, Colquitt, LePine, and Noe 2002; Becker and called a bare-bones meta-analysis.
Schram 1994) further underlines the importance of ef- 17.1.1.2 Data Errors and Outliers Bad data in meta-
forts to ensure that meta-analysis findings are free of analysis stem from a variety of errors in handling data.
correctable bias and distortion. In the absence of such Primary data used in a study may be erroneous due to
corrections, the results of path analyses and other causal transcription errors, coding errors, and so on; the initial
modeling procedures are biased in complex and often un- results of the analysis of the primary data in a particular
predictable ways. study may be incorrect due to computational errors, tran-
Most artifacts we are concerned with have been stud- scriptional errors, computer program errors, and so on;
ied in the field of psychometrics. The goal is to develop the study results as published may have errors caused by
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 319

transcriptional error by the investigator, by a typist, or As pointed out elsewhere in this volume, study effect
by a printer; or a meta-analyst may miscopy a result or sizes can be expressed in a variety of ways, with the two
make a computational error. Data errors are apparently most frequently used indices being the correlation coef-
very common (Guilliksen 1986; Tukey 1960). Sometimes ficient and the standardized mean difference (d value and
such errors can be detected and eliminated using outlier variations thereof). For ease of explication, artifact ef-
analysis, but outlier analysis can be problematic in meta- fects and corrections are discussed in this chapter in terms
analysis because it is often impossible to distinguish be- of correlations. The same principles apply to standard-
tween data errors and large sampling errors, and deletion ized mean differences, though it is often more difficult to
of data with large sampling errors can bias corrections for make appropriate corrections for artifacts affecting the
sampling error (Hunter and Schmidt 2004, 19697). independent variable in true experiments (see Hunter and
Schmidt 2004, chapters 6, 7, and 8).
17.1.2.1 Single Artifacts Most artifacts attenuate the
17.1.2 Systematic Artifacts population correlation . The amount of attenuation de-
pends on the artifact. For each artifact, it is possible to
Many artifacts have a systematic influence on study effect present a conceptual definition that makes it possible to
size parameters and their estimates. If such an effect can quantify the influence of the artifact on the observed ef-
be quantified, often there is an algebraic formula for the fect size. For example, the reliability of the dependent
effect of the artifact. Most algebraic formulas can be in- variable calibrates the extent to which there is random
verted, producing a correction formula. The resulting cor- error of measurement in the measure of the dependent
rection removes the bias created by the artifact and esti- variable. The reliability, and hence the artifact parameter
mates the effect size that would have been obtained had that determines the influence of measurement error on ef-
the researcher carried out a study without the correspond- fect size, can be empirically estimated. Journal editors
ing methodological limitation. should require authors to furnish those artifact values but
Correction for an artifact requires knowledge about the often do not. Most of the artifacts cause a systematic at-
size of the effect of that artifact. Correction for each new tenuation of the correlation; that is, the study correlation
artifact usually requires at least one new piece of infor- is lower than the actual correlation by some amount. This
mation. For example, to correct for the effects of random attenuation is usually most easily expressed as a product
error of measurement in the dependent variable, we need in which the actual correlation is multiplied by an artifact
to know the reliability of the dependent variable in the multiplier, usually denoted a.
primary studies. Many primary studies do not present in- We denote the actual (unattenuated) population corre-
formation on the artifacts in the study, but often this in- lation by  and the (attenuated) study population correla-
formation (for example, scale reliability) is available tion by o. Because we cannot conduct the study perfectly
from other sources. Even when artifact information is (that is, without measurement error), this study imperfec-
presented in the study, it is not always of the required tion systematically biases the actual correlation parame-
type. For example, in correcting for the influence of mea- ter downward. Thus the study correlation o is smaller
surement error, it is important to use the appropriate type than the actual correlation .
of reliability coefficient. Use of an inappropriate coeffi- We denote by ai the artifact value for the study ex-
cient will lead to a correction that is at least somewhat pressed in the form of a multiplier. If the artifact param-
erroneous (Hunter and Schmidt 2004, 99103), usually eter is expressed by a multiplier ai, then
an undercorrection.
There are at least ten systematic artifacts that can be o  ai , (17.1)
corrected if the artifact information is available. The cor-
rection can be made within each study individually if the where ai is some fraction, 0  ai 1. The size of ai de-
information is available for all, or nearly all, studies indi- pends on the artifactthe greater the error, the smaller
vidually. If so, then the meta-analysis is performed on the value of ai. In the developments that follow, these ar-
these corrected values, as described in section 17.3 of this tifacts are described as they occur in correlation studies.
chapter. If not, the correction can be made at the meta- However, each artifact has a direct analogue in experi-
analysis level if the distribution of artifact values across mental studies (for more detail on these analogues, see
studies can be estimated, as described in section 17.4. Hunter and Schmidt 2004, chapters 68).
320 STATISTICALLY COMBINING EFFECT SIZES

Attenuation artifacts and the corresponding multiplier Example: supervisor ratings of job performance,
are as follows: a5  0.72 (Viswesvaran, Schmidt, and Ones 1996);
mean interrater reliability of supervisor rating is
1. Random error of measurement in dependent vari-
0.52; the correlation between the supervisor
____ rating
able Y:
and job performance true scores is 0.52  0.72
___ (see also Rothstein 1990).
a1  rYY ,
o  0.72, a 28 percent reduction.
where rYY is the reliability of the measure of Y.
Example: rYY  0.49, implies a1  0.70, o  0.70,
6. Imperfect construct validity of the independent vari-
a 30 percent reduction.
able X. Construct validity is defined as in (5) above:
2. Random error of measurement in independent vari-
able X: a6  the construct validity of X.
___
a2  rXX , Example: use of a perceptual speed measure to mea-
sure general cognitive ability, a6  0.65 (the true
where rXX is the reliability of the measure of X. score correlation between perceptual speed and gen-
Example: rXX  0.81, implies a2  0.90, o  0.90, eral cognitive ability is 0.65).
a 10 percent reduction.
3. Artificial dichotomization of continuous dependent o  0.65, a 35 percent reduction.
variable split into proportions p and q:
7. Range restriction on the independent variable X.
____
a3  biserial constant  (c)/(pq) , Range restriction results from systematic exclusion
of certain scores on X from the sample compared
___
where (x)  ex /2/2 is the unit normal density
2
with the relevant (or reference) population:
function and where c is the unit normal distribution
cut point corresponding to a split of p. That is, a7 depends on the standard deviation (SD) ratio,
c  1( p), where (x) is the unit normal cumulative uX  (SDX study population)/
distribution function (Hunter and Schmidt 1990). (SDX reference population).
Example: Y is split at the median, p  q  0.5,
a3  0.80, o  0.80, a 20 percent reduction. For example, the average value of uX among em-
ployees for general cognitive ability has been found
4. Artificial dichotomization of continuous indepen- to be 0.67.
dent variable split into proportions p and q: Range restriction can be either direct, explicit
____ truncation of the X distribution, or indirect, partial
a4  biserial constant  (c)/(pq) , or incomplete truncation of the X distribution, usu-
ally resulting from selection on some variable cor-
where c is the unit normal distribution cut point cor- related with X. Volunteer bias in study subjects, a
responding to a split of p and (x) is the unit normal form of self-selection can produce indirect range
density function. That is, c    1( p) (Hunter and restriction. Most range restriction in real data is in-
Schmidt 1990). direct (Hunter, Schmidt, and Le 2006). The order in
Example: X is split such that p  0.9 and q  0.1. which corrections are made for range restriction
Then a4  0.60, o  0.60, a 40 percent reduction. and measurement error differs for direct and indi-
5. Imperfect construct validity of the dependent vari- rect range restriction. A major development since
able Y. Construct validity is the correlation of the the first edition of this handbook is the discovery of
dependent variable measure with the actual depen- practical methods of correcting for indirect range
dent variable construct: restriction, both in individual studies and in meta-
analysis (Hunter and Schmidt 2004, chapter 5;
a5  the construct validity of Y. Hunter, Schmidt, and Le 2006). These methods are
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 321

included in the Frank Schmidt and Huy Le (2004) Complication: The size of the multiplier depends
software. Correcting for direct range restriction on the size of .
when the restriction has actually been indirect re- ____________
sults in estimates that are downwardly biased, typi- a8  uY / (uY2 2 1 2) .
cally by about 25 percent. Example: For   0.20 and uY  0.83, a8  0.84,
Complication: For both types of range restriction, o  0.84, a 16 percent reduction.
the size of the multiplier depends on the size of . Note: Correction of the same correlation for
For direct range restriction the formula is: range restriction on both the independent and de-
____________ pendent variables is complicated and requires spe-
a7  uX / (u2X 2 1 2) .
cial formulas (for a discussion of this problem, see
Hunter and Schmidt 2004, 3941).
Example: For   0.20 and uX  0.67, a7  0.68.
o  0.68, a 32 percent reduction. 9. Bias in the correlation coefficient. The correlation
For indirect range restriction the formula is: has a small negative bias:
____________
a7  uT / (u2T 2 1 2) a9  1 (1 2)/(2N  2).

where uT is the ratio of restricted to unrestricted true For sample sizes of twenty or more, bias is smaller
score standard deviations (computed using either than rounding error. Thus, bias is usually trivial in
equation 3.16 in Hunter and Schmidt 2004, 106; or size, as illustrated in the following example.
equation 22 in Hunter, Schmidt, and Le 2006): Example: For   0.20 and N68, a9 0.9997,
o  0.9997, a .03 percent reduction.
uT2  {uX2  (1 rXXa)}/rXXa. 10. Study-caused variation (covariate-caused con-
founds).
Note that in this equation, rXXa is the reliability of Example: Concurrent validation studies con-
the independent variable X estimated in the unre- ducted on ability tests evaluate the job performance
stricted group. John Hunter, Schmidt, and Le used of workers who vary in job experience, whereas ap-
subscript a to denote values estimated in the appli- plicants all start with zero job experience. Job expe-
cant (unrestricted) group, and subscript i for the in- rience correlates with job performance, even holding
cumbent (restricted) group (2006). We use the same ability constant.
notation in this chapter. Solution: Use partial correlation to remove the
Example: For   0.20 and uT  0.56, a7  0.57. effects of unwanted variation in experience.
Specific case: Study done in new plant with very
o  0.57, a 43 percent reduction.
low mean job experience (for example, mean  2
8. Range restriction on the dependent variable Y. Range years). Correlation of experience with ability is zero.
restriction results from systematic exclusion of cer- Correlation of job experience with job performance
tain scores on Y from the sample in comparison to is 0.50. Comparison of the partial correlation to the
_______
the relevant population: zero order correlation shows that a10  1 .502 
0.87. o  0.87, a 13 percent reduction.
a8 depends on the SD ratio, 17.1.2.2 Multiple Artifacts Suppose that the study
uY  (SDY study population)/ correlation is affected by several artifacts with parame-
(SDY reference population). ters, a1, a2, a3, . . . . The first artifact reduces the actual
correlation from  to
Example: Some workers are fired early for poor per-
formance and are hence underrepresented in the in- o1  a1.
cumbent population. Assume that exactly the bottom
20 percent of performers are fired, a condition of di- The second artifact reduces that correlation to
rect range restriction. Then from normal curve cal-
culations, uY  0.83. o2  a2 o1  a2(a1)  a1a2 .
322 STATISTICALLY COMBINING EFFECT SIZES

The third artifact reduces that correlation to The Actual Study. In the actual study, the independent
variable X  score on an imperfect test of general cogni-
o3  a3 o2  a3(a1a2 )  a1a2a3 , and so on. tive ability, and dependent variable Y  rating by one im-
mediate supervisor. The correlation rXY is computed on a
Thus, the joint effect of the artifacts is to multiply the small sample of range-restricted workers (incumbents)
population correlation by all the multipliers. For m arti- hired by the company based on information available at
facts the time of hire.
Impact of Restriction in Range. Range restriction bi-
om  (a1a2a3 . . . am) ases the correlation downward. All studies by the U.S.
Employment Service were conducted in settings in which
or
the General Aptitude Test Battery (GATB) had not been
o  A, (17.2) used to select workers, so this range restriction is indirect.
When range restriction is indirect, its attenuating effects
where A is the compound artifact multiplier equal to the occur before the attenuating effects of measurement error
product of the individual artifact multipliers occur (Hunter, Schmidt, and Le 2006). The average ex-
tent of restriction in range observed for the GATB was
A  a1a2a3 . . . am. found to be (Hunter 1980):
uX  (SDX incumbent population)/
17.1.2.3 A Numerical Illustration We now illustrate
(SDX applicant population)  0.67.
the impact of some of these artifacts using actual data
from a large program of studies of the validity of a per- This observed range restriction uX translates into a true
sonnel selection test. The quantitative impact is large. score uT value of 0.56 based on equation 3.16 in Hunter
Observed correlations can be less than half the size of the and Schmidt (2004) or equation 22 in Hunter, Schmidt,
estimated population correlations based on perfect mea- and Le (2006) presented in the previous section (this cal-
sures and computed on the relevant (unrestricted refer- culation is based on the reliability of the GATB in the
ence) population. We give two illustrations: the impact on unrestricted population, rXX , which is 0.81). Further using
a
the average correlation between general cognitive ability the equation provided earlier in section 17.1.2.1, we ob-
and job performance ratings and the impact of variation tain the value of a7  .70, a 30 percent reduction.
in artifacts across studies. Impact of Measurement Error. The value of rXX, the
A meta-analysis of 425 validation studies conducted reliability of the GATB, is 0.81 in the unrestricted popu-
by the U.S. Employment Service shows that for medium- lation, but in the restricted group this value is reduced to
complexity jobs, the average applicant population corre- 0.58 (computed using equation 27 in Hunter, Schmidt,
lation between true scores on general cognitive ability and Le 2006: rXX 1 (1 rXX )/uX2).
i a
and true scores job performance ratings is 0.73 (Hunter, Consequently, the attenuation multiplier due to____ mea-
Schmidt, and Le 2006).1 We now show how this value is surement error in the independent variable is a2  0.58 
reduced by study artifacts to an observed mean correla- 0.76. The value of rYY, the reliability of the dependent
tion of 0.27. variable (supervisory ratings of job performance), is .50
The Ideal Study. Ideally, each worker would serve in the restricted group (Viswesvaran, Ones, and Schmidt
under a population of judges (raters), so that idiosyncrasy 1996), which translates into the attenuation multiplier of
____
of judgment could be eliminated by averaging ratings 0.71 (a1  0.50  0.71).
across judges. Let P  consensus (average) rating of per- Taken together, the combined attenuating effect due to
formance by a population of judges, and let A  actual indirect range restriction and measurement error in both
cognitive ability. The correlation rAP would then be com- the independent and dependent variable measures is:
puted on an extremely large sample of applicants hired at
random from the applicant population. Hence, there A  a1a2a7  (0.71)(0.76)(0.70)  0.38.
would be no unreliability in either measure, no range re- Hence the expected value of the observed r is:
striction, and virtually no sampling error. The Hunter,
Schmidt, and Le meta-analysis indicates that the obtained rXY  ArAP  a1a2a7rAP
correlation would be 0.73 (2006) .  (0.71)(0.76)(0.70)(0.73)  0.27.2
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 323

The total impact of study limitations was to reduce the Based on the equation, observed reliabilities for the
mean population correlation from 0.73 to 0.27, a reduc- long test will vary from to 0.56 (when uT  0.55) to 0.67
tion of 63 percent. (when uT  0.69); for the short test, reliabilities will be
Now we present numerical examples of the effects of 0.22 (when uT  0.55) and 0.32 (when uT  0.69). The
variation in artifacts across studies. Again, here we as- corresponding correlations would be:
sume a correlation of 0.73 in the applicant population for ____
perfectly measured variables. Even if there was no true 1. High range restriction ratio, long test: rXY  0.67
variation in this value across studies (for example, em- (0.60)  0.49,
____
ployers), variation in artifact values would produce sub- 2. High range restriction ratio, short test: rXY  0.32
stantial variation in observed correlations even in the (0.60)  0.33,
____
absence of sampling error. 3. Low range restriction ratio, long test: rXY  0.56
Indirect Range Restriction. None of the firms whose (0.51)  0.38,
data were available for the study used the GATB in hir- ____
ing. Different firms used a wide variety of hiring meth- 4. Low range restriction ratio, short test: rXY  0.22
ods, including cognitive ability tests other than the GATB. (0.51)  0.24.
Thus, range restriction on the GATB is indirect. Suppose Criterion reliability: one rater versus two raters. Sup-
that the composite hiring dimension used (symbolized as pose that in some studies one supervisor rates job perfor-
S) correlated on average 0.90 with GATB true scores (that mance and in other studies there are two raters. Inter-rater
is, ST  0.90, see Hunter, Schmidt, and Le 2006, figure 1). reliability is 0.50 for one rater and (by the Spearman-
Further assume that some firms are very selective, accept- Browne formula) is 0.67 for ratings based on the average
ing only the top 5 percent on their composite (uS  0.37; of two raters. Criterion reliability would then be either
calculated based on the procedure presented in Schmidt, rYY  0.50 or rYY  0.67. Consequently, the observed cor-
Hunter, and Urry 1976), while some others are relatively relations would be:
lenient, selecting half of the applicants (uS  0.60). Ex-
plicit selection on S results in range restrictions on true 1. High range
____ restriction ratio, long test, two raters:
score T of the test (X), which effect varies from uT  0.55 rXY  0.67 (0.49)  0.40,
(when uS  0.37) to uT  0.69 (when uS  0.60). These 2. High range
____ restriction ratio, long test, one rater:
calculations are based on equation 18 of Hunter, Schmidt, rXY  0.50 (0.49)  0.35,
and Le (2006):
3. High range
____ restriction ratio, short test, two raters:
rXY  0.67 (0.33)  0.27,
u   u   1.
2
T
2 2
STa S
2
STa
4. High range
____ restriction ratio, short test, one rater:
This creates an attenuating effect ranging from a7  rXY  0.50 (0.33)  0.23,
0.69 (31 percent reduction) to a7  0.82 (18 percent re- 5. Low range
____ restriction ratio, long test, two raters:
duction). The correlations would then vary from rXY  0.51 rXY  0.67 (0.38)  0.31,
to rXY  0.60.
6. Low range
____ restriction ratio, long test, one rater:
Predictor Reliability. Suppose that each study is con-
rXY  0.50 (0.38)  0.27,
ducted using either a long or a short ability test. Assume
that the reliability for the long test is rXX  0.81 and that 7. Low range
____ restriction ratio, short test, two raters:
for the short test is 0.49 in the unrestricted population. rXY  0.67 (0.24)  0.20,
Due to indirect range restriction, reliabilities of the tests 8. Low range
____ restriction ratio, short test, one rater:
in the samples will be reduced. The following equation rXY  0.50 (0.24)  0.17.
(based on equations 25 and 26 in Hunter, Schmidt, and Le
2006) allows calculation of the restricted reliability (rXXi) Thus, variation in artifacts produces variation in study
from the unrestricted reliability (rXXa) and the range re- population correlations. Instead of one population corre-
striction ratio on T (uT): lation of 0.73, we have a distribution of attenuated popu-
lation correlations: 0.17, 0.20, 0.23, 0.27, 0.27, 0.31, 0.35
u2T rXX and 0.40. Each of these values would be the population
rXX  _____________
a

u2 r 1 r
i
T XXa XXa correlation underlying a particular study. To that value
324 STATISTICALLY COMBINING EFFECT SIZES

random sampling error would then be added to yield the The corrected sample correlation is
correlation observed in the study. In this example, we
have assumed a single underlying population value of rc  ro /A, (17.6)
0.73. If there were variation in population correlations
prior to the introduction of these artifacts, that variance where A is the compound artifact multiplier. The sam-
would be increased because the process illustrated here pling error in the corrected correlation is related to the
applies to each value of the population correlation. population correlation by

17.2. CORRECTING FOR ATTENUATION-INDUCED rc  ro /A  ( o  e)/A (17.7)


BIASES  ( o /A)  (e/A)

17.2.1 The Population Correlation: Attenuation    e.


and Disattenuation
That is, the corrected correlation differs from the actual
The population correlation can be exactly corrected for effect size correlation by only sampling error e, where
the effect of any artifact. The exactness of the correction the new sampling error e is given by
follows from the absence of sampling error. Because
e  e/A. (17.8)
o  A,
Because A is less than 1, the sampling error e is larger
we can reverse the equation algebraically to obtain than the sampling error e. This can be seen in the sam-
pling error variance
  o /A. (17.3)
Var (e)  Var (e)/A2. (17.9)
Correcting the attenuated population correlation pro-
duces the value the correlation would have had if it had
However, because the average error e is essentially zero,
been possible to conduct the study without the method-
the average error e is also essentially zero (for a fuller
ological limitations produced by the artifacts. To divide
discussion, see Hunter and Schmidt 2004, chapter 3).
by a fraction is to increase the value. That is, if artifacts
reduce the study population correlation, then the corre-
sponding disattenuated (corrected) correlation must be 17.3 META-ANALYSIS OF CORRECTED
larger than the observed correlation. CORRELATIONS AND SOFTWARE
The meta-analysis methods described in this section and
17.2.2 The Sample Correlation
in section 17.4 are all based on random effects models
The sample correlation can be corrected using the same (see chapter 16 in this volume). These procedures are
formula as for the population correlation. This eliminates implemented in the Schmidt and Le (2004) Windows-
the systematic error in the sample correlation, but it does based software and have been shown in simulation stud-
not eliminate the sampling error. In fact, sampling error is ies to be accurate (for example, see Field 2005; Hall and
increased by the correction. Brannick 2002; Law, Schmidt, and Hunter 1994a; 1994b;
The sample study correlation relates to the (attenuated) Schulze 2004).
population correlation by If study artifacts are reported for each study, then for
each study, we have three numbers. For study i we have
ro  o  e, (17.4)
ri  the ith study correlation;
where e is the sampling error in ro (the observed correla- Ai  the compound artifact multiplier for study i; and
tion). To within a close approximation, the average error Ni  the sample size for study i.
is zero (Hedges 1989) and the sampling error variance is
We then compute for each study the disattenuated cor-
Var (e)  (1 o2)2/(N 1). (17.5) relation: rci  the disattenuated correlation for study i. (If
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 325

some of the studies do not provide some of the artifact 17.3.2 Corrected Versus Uncorrected
information, the usual practice is to fill in this missing Correlations
information with average values from the other studies.)
Two meta-analyses can be computed: one on the biased In some research domains, the artifact values for most
(attenuated) study correlations (the bare-bones meta-anal- individual studies are not presented in those studies. As
ysis) and one on the corrected (unbiased) correlations. a result, some published meta-analyses do not correct
for artifacts. Failure to correct means that the mean un-
corrected correlation will be downwardly biased as an
17.3.1 The Mean Corrected Correlation estimate of the actual (unattenuated or construct level)
As both Frank Schmidt and John Hunter (1977) and Larry correlation. The amount of bias in a meta-analysis of un-
Hedges and Ingram Olkin (1985) have noted, large-sam- corrected correlations will depend on the extent of error
ple studies contain more information than small-sample caused by artifacts in the average study. This average ex-
studies and thus should be given more weight. So studies tent of systematic error is measured by the average com-
are often weighted by sample size (or the inverse of their pound multiplier Ave(A).
sampling error variance, which is nearly equivalent). For To a close statistical approximation, the mean cor-
corrected correlations, Hunter and Schmidt recommended rected correlation Ave(rc) relates to the mean uncorrected
a more complicated weighting formula that takes into ac- correlation Ave(r) in much the same way as does an indi-
count the other artifact values for the studyfor example, vidual corrected correlation. Just as for a single study
the more measurement error there is in the measures of rc  r/A, (17.12)
the variables, the less the information there is in the study
(2004, chapter 3). Thus, a high-reliability study should be so to a close approximation we have for a set of studies
given more weight than a low-reliability study. Hedges
and Olkin advocated a similar weighting procedure (1985, Ave(rc)  Ave(r)/Ave(A). (17.13)
13536).
Hunter and Schmidt (2004) recommended that the Thus, to a close approximation, the difference in find-
weight for study i should be ings of an analysis that does not correct for artifacts and
one that does is the difference between the uncorrected
wi  Ni A2i , (17.10) mean correlation Ave(r) and the corrected mean correla-
tion Ave(rc) (Hunter and Schmidt 2004, chapter 4).
where Ai is the compound artifact multiplier for study i.
The average correlation can be written 17.3.3 Variance of Corrected Correlations:
Procedure
Ave(r)  wiri / wi, (17.11)
The variance of observed correlations greatly overstates
where the variance of population correlations. This is true for
corrected correlations as well as for uncorrected correla-
wi  1 for the unweighted average; tions. From the fact that the corrected correlation is
wi  Ni for the sample size weighted average; and
rci  i  ei, (17.14)
wi  Ni A2i for the full artifact weighted average (applied
to corrected correlations). where rci and i are the corrected sample and population
If the number of studies were infinite (so that sampling correlations, respectively, and ei is the sampling error, we
error would be completely eliminated), the resulting mean have the decomposition of variance
would be the same regardless of which weights were used.
Var (rc)  Var ( )  Var (e).
But for a finite number of studies, meta-analysis does not
totally eliminate sampling error; there is still some sam- Thus, by subtraction, we have an estimate of the de-
pling error left in the mean correlation. Use of the full ar- sired variance
tifact weights described here minimizes sampling error in
the mean corrected correlation. Var ()  Var (rc)  Var (e). (17.15)
326 STATISTICALLY COMBINING EFFECT SIZES

The variance of study corrected correlations is the 5. The meta-analysis of disattenuated correlations has
weighted squared deviation of the ith correlation from the four steps:
mean correlation.
_
If we denote the average corrected cor- a. Compute the mean corrected correlation using
relation by rc then weights wi  Ni A2i : Mean corrected correlation 
Ave(rc).
_
rc  Ave(rc)  wirci / w , i (17.16) b. Compute the variance of corrected correlations
_
using wi  Ni Ai2: Variance of corrected correla-
Var(rc)  wi (rci  r c)2 / w .
i (17.17) tions  Var(rc).
c. Compute the sampling error variance Var (e) by
The sampling error variance is computed by averaging averaging the individual study sampling error
the sampling error variances of the individual studies. variances:
The error variance of the individual study depends on the Var (e)  Ave(vi). (17.20)
size of the uncorrected population correlation. To esti-
mate that number,_ we first compute the average uncor- d. Now compute the estimate of the variance of
rected correlation r. population correlations by correcting for sam-
pling error: Var ()  Var (rc)  Var (e).
_
r  Ave(r)  wiri / w i 6. The final fundamental estimates (the average and
the standard deviation (SD) of ) are
where the wi are the sample sizes Ni.
For study i, we have Ave( )  Ave(rc) (17.21)

_ and
Var (ei)  vi  (1  r 2)2/(Ni  1), (17.18)
______
SD  Var () . (17.22)
and
As mentioned earlier, software is available for conduct-
Var(ei)  vi  Var (ei) /A2i  vi /A2i . ing these calculations (Schmidt and Le 2004). A proce-
dure similar to the one described here has been presented
For simplicity, denote the study sampling error vari- by Nambury Raju and his colleagues (1991; for simpli-
ance Var(ei) by vi. The weighted average error variance fied examples of application of this approach to meta-
for the meta-analysis is the average analysis see Hunter and Schmidt 2004, chapter 3; for
published meta-analyses of this sort, see Carlson et al.
Var(e)  wivi / w i (17.19) 1999; Judge et al. 2001; and Rothstein et al. 1990).
As an example, in the Kevin Carlson et al. study (1999),
where the wi  Ni A2i . a previously developed weighted biodata form was cor-
Procedure. The specific computational procedure in- related with promotion or advancement (with years of
volves six steps: experience controlled) for 7,334 managers in twenty-four
organizations. Thus, there were twenty-four studies, with
1. Given for each study, ri  uncorrected correlation, a mean N per study of 306. The reliability of the depen-
Ai  compound artifact multiplier, and Ni  sample dent variablerate of advancement or promotion rate
size. was estimated at 0.90. The standard deviation (SD) of the
2. Compute for each study, rci  corrected correlation, independent variable was computed in each organization,
and wi  the proper weight to be given to rci. and the SD of applicants (that is, the unrestricted SD) was
known, allowing each correlation to be corrected for range
3. To estimate the effect of sampling error,
_
compute variation. In this meta-analysis, there was no between
the average uncorrected correlation r. This is done studies variation in correlations due to variation in mea-
using weights wi  Ni. surement error in the independent variable-because the
4. For each study compute the sampling error vari- same biodata scale was used in all twenty-four studies.
ance: vi  the sampling error variance. Also, in this meta-analysis, the interest was in the effec-
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 327

tiveness of this particular biodata scale in predicting man- artifact. Meta-analysis can be conducted in such domains,
agerial advancement. Hence, mean  was not corrected though the procedures are more complex.
for unreliability in the independent variable. Simplified examples of application of artifact distribu-
The results were as follows: tion meta-analysis can be found in Hunter and Schmidt
(2004, chapter 4). Many published meta-analyses have
Ave(ri )  0.48, been based on these artifact distribution meta-analysis
methods (see Hunter and Schmidt 2004, chapters 1 and 4).
Ave( )  0.53,
A subset of these have been conducted on correlation
Var ( )  Var (rc)  Var (e) coefficients representing the validities of various kinds
of predictors of job performanceusually tests, but also
 0.00462  0.00230  0.00232
interviews, ratings of education and job experience, as-
sessment centers, and others. The implications of the find-
and
ings of these meta-analyses for personnel selection prac-
______ tices have been quite profound (see Schmidt, Hunter, and
SD  Var ()  0.048.
Pearlman 1980; Pearlman, Schmidt, and Hunter 1980;
Schmidt and Hunter 1981; Schmidt et al. 1985; Schmidt,
Thus, the mean operational validity of this scale across
Hunter, and Raju 1988; Schmidt, Ones, and Hunter 1992;
organization was estimated as 0.53, with a standard devi-
Schmidt and Hunter 1998, 2003; Schmidt, Oh, and Le
ation of 0.048. After correcting for measurement error,
2006; McDaniel, Schmidt, and Hunter 1988; Le and
range variation, and sampling error, there is only a small
Schmidt 2006; McDaniel et al. 1994). Artifact distribu-
apparent variability in correlation across organizations. If
tion meta-analyses have also been conducted in a variety
we assume a normal distribution for , the value at the
of other research areas, such as role conflict, leadership,
tenth percentile is 0.53  1.28  0.048  0.47. Thus, the
effects of goal setting, and work-family conflict. More
conclusion is that the correlation is at least 0.47 in 90 per-
than 100 such nonselection meta-analyses have appeared
cent of these (and comparable) organizations. The value
in the literature to date.
at the ninetieth percentile is 0.59, yielding an 80 percent
The artifact distribution meta-analysis procedures de-
credibility interval of 0.47 to 0.59, indicating that an esti-
scribed in this section are implemented in the Schmidt
mated 80 percent of population values of validity lie in
and Le software (2004), though the methods used in these
this range. Although the computation procedures used are
programs to estimate the standard deviation of the popu-
not described in this chapter, confidence intervals can be
lation corrected correlations (discussed in section 17.4.2)
placed around the mean validity estimate. In this case the
are slightly different from those described here. Although
95 percent confidence interval for the mean is 0.51 to
the methods used in the Schmidt-Le programs for this
0.55. The reader should note that confidence intervals and
purpose are slightly more accurate, as shown in simula-
credibility intervals are different and serve different pur-
tion studies, than those described in the remainder of this
poses (Hunter and Schmidt 2004, 2057). Confidence
chapter (Hunter and Schmidt 2004, chapter 4), they are
intervals refer only to the estimate of the mean, while cred-
also much more complexin fact, too complex to de-
ibility intervals are based on the estimated distribution of
scribe easily in a chapter of this sort. Similar artifact dis-
all of the population correlations. Hence confidence inter-
tribution methods for meta-analysis have been presented
vals are based on the estimated standard error of the mean,
by John Callender and Hobart Osburn (1980) and Nam-
and credibility intervals are based on the estimated stan-
bury Raju and Michael Burke (1983). In all these meth-
dard deviation of the population correlations.
ods, the key assumption in considering artifact distribu-
tions is independence of artifact values across artifacts.
17.4 ARTIFACT DISTRIBUTION META-ANALYSIS This assumption is plausible for the known artifacts in
AND SOFTWARE research domains that have been examined. The basis for
independence is the fact that the resource limitations that
In most contemporary research domains, the artifact val- produce problems with one artifact (for example, range
ues are not provided in many of the studies. Instead, arti- restriction) are generally different and hence independent
fact values are presented only in a subset of studies, usu- of those that produce problems with another (for example,
ally a different but overlapping subset for each individual measurement error in scales used) (Hunter and Schmidt
328 STATISTICALLY COMBINING EFFECT SIZES

2004, chapter 4). Artifact values are also assumed to be The sample attenuated correlation roi for each study is
independent of the true score correlation i. In a computer related to the attenuated population correlation for that
simulation study, Nambury Raju and his colleagues found study by
that violation of these independence assumptions has
minimal effect on meta-analysis results unless the arti- roi  oi  eoi ,
facts are correlated with the i, a seemingly unlikely event
(1998). where eoi  the sampling error in study i (which is un-
We use the following notation for the correlations as- known).
sociated with the ith study: 17.4.1.1 Meta-Analysis of Attenuated Correlations
The meta-analysis uses the additivity of means to pro-
i  the true (unattenuated) study population correla-
duce
tion;
rci  the study sample corrected correlation that can be Ave(roi)  Ave(oi  eoi)  Ave(oi)  Ave(eoi).
computed if artifact information is available for the
study so that corrections can be made; If the number of studies is large, the average sampling
roi  uncorrected (observed) study sample correlation; error will tend to zero and hence
and
Ave(roi)  Ave(oi)  0  Ave(oi ).
oi  uncorrected (attenuated) study population corre-
lation. Thus, the bare-bones estimate of the mean attenuated
In the previous section (section 3), we assumed that ar- study population correlation is the expected mean attenu-
tifact information is available for every (or nearly every) ated study sample correlation.
study individually. Thus, an estimate rci of the true correla- 17.4.1.2 Correcting the Mean Correlation The at-
tion i can be computed for each study and meta-analysis tenuated population correlation for study i is related to the
can be conducted on these estimates. In this section, we disattenuated correlation for study i by oi  Ai i, where
assume that artifact information is missing for many or Ai  the compound artifact multiplier for study i.
most studies. However, we assume that the distribution Thus, the mean attenuated correlation is given by
(or at least the mean and variance) of artifact values can be
estimated for each artifact. The meta-analysis then pro- Ave(oi )  Ave(Ai i). (17.23)
ceeds in two steps:
Because we assume that artifact values are indepen-
A bare bones meta-analysis is conducted, yielding dent of the size of the true correlation, the average of the
estimates of the mean and standard deviation of at- product is the product of the averages:
tenuated study population correlations. A bare bones
meta-analysis is one that corrects only for sampling Ave(Ai i)  Ave(Ai)Ave(i). (17.24)
error.
The mean and standard deviation from the bare bones Hence, the average attenuated correlation is related to
meta-analysis are then corrected for the effects of ar- the average disattenuated correlation by Ave(oi)  Ave(Ai)
tifacts other than sampling error. Ave(i), where Ave(Ai)  the average compound multi-
plier across studies.
17.4.1 Mean of the Corrected Correlations Note that we need not know all the individual study
artifact multipliers; we need only know the average. If the
The attenuated study population correlation oi is related average multiplier is known, then the corrected mean cor-
to the actual study population correlation i by the for- relation is
mula
oi  Ai i, Ave(i )  Ave(roi)/Ave(Ai). (17.25)

where Ai  the compound artifact multiplier for study i 17.4.1.3 The Mean Compound Multiplier To esti-
(which is unknown for most studies). mate the average compound multiplier, it is sufficient to
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 329

be able to estimate the average for each single artifact Callender and Osburn (1980) and is the least complicated
multiplier separately. This follows from the independence method to present. Other methods have also been pro-
of artifacts. To avoid double subscripts, let us denote the posed and used, however. Nambury Raju and Burke
separate artifact multipliers by a, b, c, . . . The compound (1983) and John Hunter, Schmidt, and Le (2006) have
multiplier A is then given by the product presented Taylor Series-based methods for use under
conditions of direct and indirect range restriction, respec-
A  abc . . . tively. Still another procedure is the interactive procedure
(Schmidt, Gast-Rosenberg, and Hunter 1980; Hunter and
Because of the independence of artifacts, the average Schmidt 2004, chapter 4); simulation studies have sug-
product is the product of averages: gested that the interactive procedure (with certain refine-
ments) is slightly more accurate than other procedures
Ave(Ai)  Ave(a)Ave(b)Ave(c) . . . (17.26) (Law, Schmidt, and Hunter 1994a, 1993b; Le and Schmidt
2006), and so it is the procedure incorporated in the
Thus, the steps in estimating the compound multiplier Schmidt and Le software (2004). However, by the usual
are as follows: standards of social science research, all three procedures
are accurate.
1. Consider the separate artifacts.
The meta-analysis of uncorrected correlations provides
a. Consider the first artifact a. For each study that
an estimate of the variance of attenuated study population
includes a measurement of the artifact magni-
correlations. However, these study population correla-
tude, denote the value ai. Average those values to
tions are themselves uncorrected; that is, they are biased
produce Ave(ai)  average of attenuation multi-
downward by the effects of artifacts. Furthermore, the
plier for first artifact.
variation in artifact levels across studies causes the study
b. Consider the second artifact b. For each study
correlations to be attenuated by different amounts in dif-
that includes a measurement of the artifact mag-
ferent studies. This produces variation in the size of the
nitude, denote the value bi. Average those values
study correlations that could be mistaken for variation
to produce Ave(bi)  average of attenuation mul-
due to a real moderator variable, as we saw in our nu-
tiplier for second artifact.
merical example in section 17.1.2.3. Thus the variance of
c. Similarly, consider the other separate artifacts c,
population study correlations computed from a meta-
d, and so on, that produce estimates of the aver-
analysis of uncorrected correlations is affected in two dif-
ages Ave(ci), Ave(di) . . . .
ferent ways. The systematic artifact-induced reduction in
The accuracy of these averages depends on the the magnitude of the study correlations tends to decrease
assumption that the available artifacts are a reason- variability, while at the same time variation in artifact
ably representative sample of all artifacts (for a dis- magnitude tends to increase variability across studies.
cussion of this assumption, see Hunter and Schmidt Both sources of influence must be removed to accurately
2004). Note that even if this assumption is not fully estimate the standard deviation of the disattenuated cor-
met, results will still tend to be more accurate than relations across studies.
those from a meta-analysis that does not correct for Let us begin with notation. The study correlation free
artifacts. of study artifacts (the disattenuated correlation) is de-
2. Compute the product noted i , and the compound artifact attenuation factor for
study i is denoted Ai. The attenuated study correlation, oi
Ave(Ai)  Ave(a)Ave(b)Ave(c)Ave(d) . . . . (17.27) is computed from the disattenuated study correlation by

oi  Ai i . (17.28)
17.4.2 Correcting the Standard Deviation
The study sample correlation roi departs from the disat-
The method described in this section for estimating the tenuated study population correlation oi by sampling
standard deviation of the population true score correla- error ei defined by
tions is the multiplicative method (Hunter and Schmidt
2004, chapter 4). This approach was first introduced by roi  oi  ei  Aii  e. (17.29)
330 STATISTICALLY COMBINING EFFECT SIZES

Consider now a bare-bones meta-analysis on the un- The right-hand side of equation 17.36 has four num-
corrected correlations. We know that the variance of sam- bers:
ple correlations is the variance of population correlations Var (oi ): the population correlation variance from the
added to the sampling error variance. That is, meta-analysis of uncorrected correlations, estimated
Var (roi)  Var (oi)  Var (ei). (17.30) using equation 17.31;
__
 : the mean of disattenuated study population correla-
Because the sampling error variance can be computed tions, estimated using equation 17.25;
by statistical formula, we can subtract it to yield __
A: the mean compound attenuation factor, estimated
Var (oi)  Var (roi)  Var (ei). (17.31) from equation 17.26; and
Var (Ai ): the variance of the compound attenuation fac-
That is, the meta-analysis of uncorrected correlations tor. This quantity has not yet been estimated.
produces an estimate of the variance of attenuated study
population correlations, the actual study correlations after 17.4.2.1 Variance of the Artifact Multiplier How do
they have been reduced in magnitude by the study imper- we compute the variance of the compound attenuation
fections. factor, Ai? We are given the distribution of each compo-
At the end of a meta-analysis of uncorrected (attenu- nent attenuation factor, which must be combined to pro-
ated) correlations, we have the variance of attenuated duce the variance of the compound attenuation factor.
study population correlations Var ( oi), but we want the The key to this computation lies in two facts: (a) that the
variance of actual disattenuated correlations Var ( i). The compound attenuation factor is the product of the compo-
relationship between them is nent attenuation factors and (b) that the attenuation fac-
tors are assumed to be independent. That is, because
Var (oi)  Var (Ai i). (17.32) Ai  aibici . . . , the variance of Ai is

We assume that Ai and i are independent. A formula Var (Ai)  Var (aibi ci . . . ). (17.37)
for the variance of this product is given in Hunter and
Schmidt (2004, chapter 4). Here we simply use this for- The variance of the compound attenuation factor is the
mula. Let us__ denote the average disattenuated study cor- variance of the product of independent component atten-
relation by  __
and denote the average compound attenua- uation factors. The formula for the variance of this prod-
tion factor by A. The variance of the attenuated correlations uct is derived in Hunter and Schmidt (2004, chapter 4,
is given by 14849). Here we simply use the result.
__ __
For each separate artifact, we have a mean and a stan-
Var (Ai i)  A 2Var (i)  2Var (Ai ) dard deviation for that component attenuation factor.
From the mean and the standard deviation, we can com-
 Var (i)Var (Ai). (17.33)
pute the coefficient of variation, which is the standard
Because the third term on the right is negligibly small, deviation divided by the mean. For our purposes here, we
to a close approximation need a symbol for the squared coefficient of variation:
__ __
Var (Ai i)  A 2Var (i)  2Var (Ai). (17.34) cv  [SD/Mean]2. (17.38)

We can then rearrange this equation algebraically to For each artifact, we now compute the squared coeffi-
obtain the desired equation for the variance of actual cient of variation. For the first artifact attenuation factor
study correlations free of artifact effects: a, we compute
__ __
Var (i)  [Var (Ai i )  2Var (Ai)] / A 2. (17.35) cv1  Var (a)/[Ave(a)]2. (17.39)

That is, starting from the meta-analysis of uncorrected For the second artifact attenuation factor b, we com-
correlations, we have pute
__ __
Var (i)  [Var (oi)  2Var (Ai)] / A 2. (17.36) cv2  Var (b)/[Ave(b)]2. (17.40)
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 331

For the third artifact attenuation factor c, we compute and


__
cv3  Var(c)/[Ave(c)]2, (17.41) Var(Ai)  A 2CVT. (17.46)

and so on. Thus, we compute a squared coefficient of That is,


variation for each artifact. These are then summed to __ __ __
form a total Var (roi)  A2Var ( i)   2A2CVT  Var (ei) (17.47)
S1  S2  S3.
CVT  cv1  cv2  cv3  . . . . (17.42)
__ where S1 is the variance in uncorrected correlations pro-
Recalling that A denotes the mean compound attenua-
duced by the variation in actual unattenuated effect size
tion factor, we write the formula for the variance of the
correlations, S2 is the variance in uncorrected correla-
compound attenuation factor (to a close statistical ap-
tions produced by the variation in artifact levels and S3
proximation) as the product
is the variance in uncorrected correlations produced by
__
Var (Ai)  A 2CVT. (17.43) sampling error.
In this decomposition, the term S1 contains the esti-
We now have all the elements needed to estimate the mated variance of effect size correlations. This estimated
variance in actual study correlations Var(i). The final variance is corrected for those artifacts that were cor-
formula is rected in the meta-analysis. For reasons of feasibility, this
__ does not usually include all the artifacts that affect the
__
Var (i)  [Var (oi)  2Var (Ai)] / A 2 study value. Thus, S1 is an upper-bound estimate of the
__ __ __ component of variance in uncorrected correlations due to
 [Var (oi)  2 A 2 CVT ] / A 2. (17.44) real variation in the strength of the relationship and not
due to artifacts of the study design. To the extent that
The square root of this value is the SD. Hence we now
there are uncorrected artifacts, S1 will overestimate the
have two main results of the meta-analysis: the mean of
real variation; it may even greatly overestimate that varia-
the corrected correlations, from equation 17.25; and the
tion. As long as there are uncorrected artifacts, there will
standard deviation of the correction correlations, from
be artifactual variation in study correlations produced by
equation 17.44. Using these values, we can again com-
variation in those uncorrected artifacts.
pute credibility intervals around the mean corrected cor-
relation, as illustrated at the end of section 17.3.3. Also,
using the methods described in Hunter and Schmidt (2004, 17.5 SUMMARY
2057) we can compute confidence intervals around the
The methods presented in this chapter for correcting
mean corrected correlation.
meta-analysis results for sampling error and biases in in-
For data sets for which the meta-analysis methods de-
dividual study results might appear complicated, but the
scribed in section 17.3 can be applied (that is, correlations
reader should be aware that more elaborated and extended
can be corrected individually), the artifact distribution
descriptions of these procedures are presented in Hunter
methods described in this section can also be applied.
and Schmidt (2004) and that Windows-based software
When this is done, the results are virtually identical, as
(Schmidt and Le 2004) is available to apply these meth-
expected. Computer simulation studies also indicate that
ods in meta-analysis. It is also important to bear in mind
artifact distribution meta-analysis methods are quite ac-
that without these corrections, for biases and sampling
curate (for example, see Le and Schmidt 2006).
error the results of meta-analysis do not estimate the con-
17.4.2.2 Decomposition of the Variance of Correla-
struct-level population relationships that are of greatest
tions Inherent in the derivation in section 17.4.2.1 is a
scientific and theoretical interest (Rubin 1990; Schmidt
decomposition of the variance of uncorrected (observed)
and Hunter 2004, chapters 1, 14). Hence, in meta-analysis,
correlations. We now present that decomposition:
these corrections are essential to the development of valid
cumulative knowledge about relations between constructs.
Var (roi)  Var (oi)  Var (ei),
__ __
This is especially important in light of recent develop-
Var (oi)  Var (Ai i)  A 2Var (i)   2Var (Ai), (17.45) ments concerning the use of meta-analytic results in the
332 STATISTICALLY COMBINING EFFECT SIZES

testing of causal models. Meta-analytic results are increas- Hedges, Larry V. 1989. An Unbiased Correction for Sam-
ingly being used as input to path analyses and other causal pling Error in Validity Generalization Studies. Journal of
modeling techniques. Without appropriate corrections for Applied Psychology 74(3): 46977.
measurement error, range restriction, and other artifacts Hedges, Larry, and Ingram Olkin. 1985. Statistical Methods
that bias meta-analysis results, these causal modeling pro- for Meta-Analysis. Orlando, Fl.: Academic Press.
cedures will produce erroneous results. Hunter, John. 1980. Validity Generalization for 12,000 Jobs:
An Application of Synthetic Validity and Validity Generali-
17.6 NOTES zation to the General Aptitude Test Battery (GATB).
Washington, D.C.: U.S. Department of Labor.
1. Table 3 in Hunter, Schmidt, and Le (2006) shows the Hunter, John, and Frank Schmidt. 1990. Dichotomization
value of operational validity for medium complexity of Continuous Variables: The Implications for Meta-Analy-
job is 0.66, which was obtained from the true score sis. Journal of Applied Psychology 75(3): 33449.
correlation of 0.73. . 2004. Methods of Meta-Analysis: Correcting Error
2. The value here is slightly different from .26 shown in and Bias in Research Findings, 2nd ed. Thousand Oaks,
Hunter, Schmidt, and Le (2006) due to rounding of Calif.: Sage Publications.
values beyond the second decimal place. Hunter, John, Frank Schmidt, and Huy Le. 2006. Implica-
tions of Direct and Indirect Range Restriction for Meta-
17.7 REFERENCES Analysis Methods and Findings. Journal of Applied Psy-
chology 91(3): 594612.
Becker, Betsy, and Christine Schram. 1994. Examining Ex- Judge, Timothy, Carl Thoresen, Joyce Bono, and Gregory
planatory Models Through Research Synthesis. In The Patton. 2001. The Job Satisfaction-Job Performance Rela-
Handbook of Research Synthesis, edited by Harris Cooper tionship: A Qualitative and Quantitative Review. Psycho-
and Larry Hedges. New York: Russell Sage Foundation. logical Bulletin 127(3): 376407.
Callender, John, and Hobart Osburn. 1980. Development Law, Kenneth, Frank Schmidt, and John Hunter. 1994a.
and Test of a New Model for Validity Generalization. Nonlinearity of Range Corrections in Meta-Analysis: A
Journal of Applied Psychology 65(5): 54358. Test of an Improved Procedure. Journal of Applied Psy-
Carlson, Kevin, Steve Scullen, Frank Schmidt, Hannah chology 79(3): 42538.
Rothstein, and Frank Erwin. 1999. Generalizable Biogra- Law, Kenneth, Frank Schmidt, and John Hunter 1994b. A
phical Data Validity: Is Multi-Organizational Development Test of Two Refinements in Meta-Analysis Procedures.
and Keying Necessary? Personnel Psychology 52(3): Journal of Applied Psychology 79(6): 97886.
73156. Le, Huy, and Frank Schmidt 2006. Correcting for Indirect
Colquitt, Jason, Jeffrey LePine, and Raymond Noe. 2002. Range Restriction in Meta-Analysis: Testing a New Analy-
Toward an Integrative Theory of Training Motivation: A tic Procedure. Psychological Methods 11(4): 41638.
Meta-Analytic Path Analysis of 20 Years of Research. McDaniel, Michael, Debra Whetzel, Frank Schmidt, and
Journal of Applied Psychology 85(5): 678707. Steven Maurer. 1994. The Validity of Employment Inter-
Cook, Thomas, Harris Cooper, David Cordray, Heidi Hart- views: A Comprehensive Review and Meta-Analysis.
man, Larry Hedges, Richard Light, Thomas Louis, and Fre- Journal of Applied Psychology 79(4): 599616.
drick Mosteller. 1992. Meta-Analysis for Explanation: A McDaniel, Michael, Frank Schmidt, and John Hunter 1988.
Casebook. New York: Russell Sage Foundation. A Meta-Analysis of the Validity of Methods for Rating
Field, Andy. 2005. Is the Meta-Analysis of Correlation Training and Experience in Personnel Selection. Person-
Coefficients Accurate when Population Correlations Vary? nel Psychology 41(2): 283314.
Psychological Methods 10(4): 44467. Pearlman, Kenneth, Frank Schmidt, and John Hunter 1980.
Gulliksen, Harold. 1986. The Increasing Importance of Validity Generalization Results for Tests Used to Predict
Mathematics in Psychological Research (Part 3). The Job Proficiency and Training Criteria in Clerical Occupa-
Score 9: 15. tions. Journal of Applied Psychology 65(4): 373407.
Hall, Steven, and Michael Brannick. 2002. Comparison of Raju, Nambury, and Michael Burke. 1983. Two New Proce-
Two Random Effects Methods of Meta-Analysis. Journal dures for Studying Validity Generalization. Journal of Ap-
of Applied Psychology 87(2): 37789. plied Psychology 68(3): 38295.
CORRECTING FOR THE DISTORTING EFFECTS OF STUDY ARTIFACTS 333

Raju, Nambury, Michael Burke, Jacques Normand, and . 2003. History, Development, Evolution, and Im-
George Langlois. 1991. A New Meta-Analysis Approach. pact of Validity Generalization and Meta-Analysis Methods
Journal of Applied Psychology 76(3): 43246. 19752001. In Validity Generalization: A Critical Re-
Raju, Nambury, Tobin Anselmi, Jodi Goodman, and Adrian view, edited by Kevin Murphy. Mahwah, N.J.: Lawrence
Thomas. 1998. The Effects of Correlated Artifacts and Erlbaum.
True Validity on the Accuracy of Parameter Estimation in Schmidt, Frank, Deniz Ones, and John Hunter. 1992. Person-
Validity Generalization. Personnel Psychology 51(2): nel Selection. Annual Review of Psychology 43: 62770.
45365. Schmidt, Frank, Ilene Gast-Rosenberg, and John Hunter
Rothstein, Hannah, Frank Schmidt, Frank Erwin, William 1980. Validity Generalization Results for Computer Pro-
Owens, and Paul Sparks. 1990. Biographical Data in Em- grammers. Journal of Applied Psychology 65(6): 64361.
ployment Selection: Can Validities Be Made Generaliza- Schmidt, Frank, In-Sue Oh, and Huy Le. 2006. Increasing
ble? Journal of Applied Psychology 75(2): 17584. the Accuracy of Corrections for Range Restriction: Impli-
Rothstein, Hannah. 1990. Interrater Reliability of Job Per- cations for Selection Procedure Validities and Other Re-
formance Ratings: Growth to Asymptote with Increasing search Results. Personnel Psychology 59(2): 281305.
Opportunity to Observe. Journal of Applied Psychology Schmidt, Frank, John Hunter, and Kenneth Pearlman. 1980.
75(3): 32227. Task Difference and Validity of Aptitude Tests in selec-
Rubin, Donald B. 1990. A New Perspective on Meta-Ana- tion: A Red Herring. Journal of Applied Psychology 66(2):
lysis. In The Future of Meta-Analysis, edited by Kenneth 16685.
Wachter and Miron Straf. New York: Russell Sage Foun- Schmidt, Frank, John Hunter, and Nambury Raju. 1988.
dation. Validity Generalization and Situational Specificity: A Se-
Schmidt, Frank. 1992. What Do Data Really Mean? Re- cond Look at the 75% Rule and the Fisher z Transforma-
search Findings, Meta-Analysis, and Cumulative Knowledge tion. Journal of Applied Psychology 73(4): 66572.
in Psychology. American Psychologist 47(10): 117381. Schmidt, Frank, John Hunter, Kenneth Pearlman, and Han-
Schmidt, Frank, and Huy Le. 2004. Software for the Hunter- nah Hirsh. 1985. Forty Questions about Validity Generali-
Schmidt Meta-Analysis Methods. Iowa City: University of zation and Meta-Analysis. Personnel Psychology 38(4):
Iowa, Department of Management and Organizations. 697798.
Schmidt, Frank, and John Hunter. 1977. Development of a Schulze, Ralf. 2004. Meta-Analysis: A Comparison of Ap-
General Solution to the Problem of Validity Generaliza- proaches. Cambridge, Mass.: Hogrefe and Huber.
tion. Journal of Applied Psychology 62(5): 52940. Tukey, John. 1960. A Survey of Sampling from Contamina-
. 1981. Employment Testing: Old Theories and New ted Distributions. In Contributions to Probability and Sta-
Research Findings. American Psychologist 36(10): 1128 tistics, edited by Ingram Olkin. Stanford, Calif.: Stanford
37. University Press.
. 1998. The Validity and Utility of Selection Methods Viswesvaran, Chockalingam, Deniz Ones, and Frank Sch-
in Personnel Psychology: Practical and Theoretical Impli- midt. 1996. Comparative Analysis of the Reliability of Job
cations of 85 Years of Research Findings. Psychological Performance Ratings. Journal of Applied Psychology 81(5):
Bulletin 124(2): 26274. 55774.
18
EFFECT SIZES IN NESTED DESIGNS

LARRY V. HEDGES
Northwestern University

CONTENTS
18.1 Introduction 338
18.2 Designs with Two Levels 339
18.2.1 Model and Notation 339
18.2.1.1 Intraclass Correlation with One Level of Nesting 340
18.2.1.2 Primary Analyses 340
18.2.2 Effect Sizes in Two-Level Designs 340
18.2.3 Estimates of Effect Sizes in Two-Level Designs 341
18.2.3.1 Estimating W 341
18.2.3.2 Estimating B 342
18.2.3.3 Estimating T 343
18.2.3.4 Confidence Intervals for W, B, and T 343
18.2.4 Applications in Two-Level Designs 344
18.2.4.1 Effect Sizes for Individuals 344
18.2.4.2 Effect Sizes for Clusters 345
18.3 Designs with Three Levels 345
18.3.1 Model and Notation 345
18.3.1.1 Intraclass Correlations 347
18.3.1.2 Primary Analyses 347
18.3.2 Effect Sizes 347
18.3.3 Estimates of Effect Sizes in Three-Levels Designs 348
18.3.3.1 Estimating WC 348
18.3.3.2 Estimating WS 349
18.3.3.3 Estimating WT 350
18.3.3.4 Estimating BC 351
18.3.3.5 Estimating BS 351
18.3.3.6 Unequal Sample Sizes when Using dBC and dBS 351
18.3.3.7 Confidence Intervals for WC, WS, WT, BC, and BS 352

337
338 SPECIAL STATISTICAL ISSUES AND PROBLEMS

18.3.4 Applications at Three Levels 352


18.3.4.1 Effect Sizes When Individuals Are the Unit of Analysis 352
18.4 Can a Level of Clustering Safely Be Ignored? 353
18.5 Sources of Intraclass Correlation Data 354
18.6 Conclusions 355
18.7 References 355

18.1 INTRODUCTION bining them across studies in meta-analyses have received


less attention. Brenda Rooney and David Murray called
Studies with nested designs are frequently used to evalu- attention to the problem of effect-size estimation in clus-
ate the effects of social treatments, such as interventions, ter randomized trials and suggested that neither the con-
products, or technologies in education or public health. ventional estimates of effect sizes nor the formulas for
One common nested design assigns entire sitesoften their standard errors were appropriate in this situation
classrooms, schools, clinics, or communitiesto the same (1996). In particular, standard errors of effect-size esti-
treatment group, with different sites assigned to different mates will be profoundly underestimated if conventional
treatments. Experiments with designs of this type are also formulas are used. Allan Donner and Neil Klar suggested
called group randomized or cluster randomized designs that corrections for the effects of clustering should be in-
because sites such as schools or communities correspond troduced in meta-analyses of cluster randomized experi-
to statistical clusters. In experimental design terminology, ments (2002). Larry Hedges studied the problem of effect
these designs are designs involving clusters as nested fac- sizes in cluster randomized designs and computed their
tors. Nested factors are groupings of individuals that occur sampling distributions, including their standard errors
only within one treatment condition, such as schools or (2007, in press).
communities in designs that assign whole schools or com- The standardized mean difference, which is widely
munities to treatments. used as an effect-size index in educational and psycho-
Assigning entire groups to receive the same treatments logical research, is defined as the ratio of a difference
is sometimes done for the convenience of the investigator between treatment and control group means to a stan-
because it is easier to provide the same treatment to all dard deviation. In designs where there is no nesting, that
individuals in a group. In other situations, group assign- is, where there is no statistical clustering, the notion of
ment is used to prevent contamination of one treatment standardized mean difference is often unambiguous. In
group by elements of treatments assigned to others. In designs with two or more levels of nesting, such as clus-
this case, physical separation of groups (for example, dif- ter randomized trials, several possible standard devia-
ferent schools or communities) minimizes contamina- tions could be used to compute standardized mean dif-
tion. In other situations, the treatment is an administrative ferences, so the concept of this effect-size measure is
or group-based manipulation, making it conceptually dif- ambiguous.
ficult to imagine how the treatment could be assigned at This chapter has two goals. One is to provide defini-
an individual level. For example, treatments that involve tions of effect sizes that may be appropriate for cluster
changing teacher behavior cannot easily be assigned to randomized trials. The second is to provide methods to
individual students, nor can administrative or procedural estimate these effect sizes (and obtain standard errors for
reforms in a school or clinic easily be assigned to indi- them) from statistics that are typically given in reports of
viduals within a school or clinic. research (that is, without a reanalysis of the raw data). To
Although the analysis of cluster randomized designs ease the exposition, I consider studies with two levels of
has received considerable attention in the statistical liter- nesting (for example, students within schools or individu-
ature (for example, Donner and Klar 2000; Raudenbush als within communities) first, then consider effect sizes in
and Bryk 2002), the problem of representation of the re- studies with three levels of nesting (for example, students
sults of cluster randomized trials (and the corresponding within classrooms within schools or individuals within
quasi-experiments) in the form of effect sizes and com- plants within companies).
EFFECT SIZES IN NESTED DESIGNS 339
mT mC
18.2 DESIGNS WITH TWO LEVELS __ __ __ __
(Y  Y *T )2  ( Y iC  Y *C )2
T
i

SB2  i______________________
 1 i1
,
The most widely used designs, or at least the designs mT  mC  2
most widely acknowledged in analyses, are designs with __
one level of nestingso-called two-level designs. In such where Y *T is the (unweighted)__mean of the mT cluster means
designs, individuals are sampled by first sampling exist- in the treatment group, and Y *C is the (unweighted) mean
ing groups of individuals, such as classrooms, schools, of the mC cluster means in the control group. That is,
communities, or hospitals, and then individuals within mT __
__
the groups In such designs whole groups are assigned to
treatments.
1
Y  ___
T
YT
mT i  1 i
*
and
18.2.1 Model and Notation __ mC __
mC i
1
Y *C  ___ YC
To adequately define effect sizes, we need to first describe i1
the notation used in this section. Let YijT (i  1, . . . , mT; __
T
Note that when __ cluster sample sizes are unequal, Y *
j  1, . . . , niT ) and YijC (i  1, . . . , mC; j  1, . . . , niC ) need__not equal Y T, the grand mean of the treatment group
be the jth observation in the ith cluster in the treatment __
and Y *C need not equal Y C, the grand mean of the control
and control groups respectively, so that there are mT group. __However, when
clusters in the treatment group and mC clusters in the __ __ __ cluster sample sizes are all
equal, Y *T  Y T and Y *C  Y C.
control group, and a total of M  mT  mC. Thus the sam- Suppose that observations within the treatment and
ple size is control group clusters are normally distributed about
mT
cluster means iT and iC with a common within-cluster
N T  nTi variance W2. That is
i1

in the treatment group, YijT  N(iT, W2 ), i  1, . . . , mT; j  1, . . . , niT


mC
N C  nCi and
i1

in the control group, and the total sample size is YijC  N(iC, W2 ), i  1, . . . , mC; j  1, . . . , niC.
N  NT__ NC. __
Let Y iT (i 1, . . . , mT ) and Y iC (i  1, . . . , mC ) be the Suppose further that the clusters have random effects
means of the ith cluster in the __ treatment
__ and control (for example they are considered a sample from a popula-
groups, respectively, and let Y T and Y C be the overall tion of clusters) so that the cluster means themselves have
(grand) means in the treatment and control groups, re- a normal sampling distribution with means T and C and
spectively. Define the (pooled) within-cluster sample common variance B2. That is
variance SW2 via
nTi nCi
iT  N(T, B2 ), i  1, . . . , mT
mT __ mC __
(Y
ij  Y i )  ( Yij  Y i )
T

i1 j1
T 2

i1 j1
C C 2

and
SW2  __________________________
NM
iC  N(C, B2 ), i  1, . . . , mC.
and the total pooled within-treatment group variance
ST2 via Note that in this formulation, B2 represents true varia-
mT niT
__ mC niC
__ tion of the population means of clusters over and above the
(Y  Y T )2  (YijC  Y C )2
i1 j1
T
ij
i1 j1
variation in sample means that would be expected from
ST2  __________________________ . variation in the sampling of observations into clusters.
N2
These assumptions correspond to the usual assump-
Let SB be the pooled within treatment-groups standard tions that would be made in the analysis of a multi-site
deviation of the cluster means given by trial by a hierarchical linear models analysis, an analysis
340 SPECIAL STATISTICAL ISSUES AND PROBLEMS

of variance (with treatment as a fixed effect and cluster as clusters. Because the number of individuals typically
a nested random effect), or a t-test using the cluster means greatly exceeds the number of clusters, V(  W2 ) will be so
in treatment and control group as the unit of analysis. much smaller than V(  B2) that V(  W2 ) can be considered
In principle there are three different within-treatment negligible.
group standard deviations, B, W, and T, the latter de-
fined by 18.2.2 Effect Sizes in Two-Level Designs
T2  B2  W2 . In designs with one level of nesting (two-level designs),
three possible standardized mean difference parameters
In most educational data when clusters are schools, B2 correspond to the three different standard deviations. The
is considerably smaller than W2 . Obviously, if the between choice of one of these effect sizes should be determined
cluster variance B2 is small, then T2 will be very similar on the basis of the inference of interest to the researcher.
to W2 . If the effect-size measures are to be used in meta-analy-
18.2.1.1 Intraclass Correlation with One Level of sis, an important inference goal may be to estimate pa-
Nesting A parameter that summarizes the relationship rameters comparable with those that can be estimated in
between the three variances is called the intraclass cor- other studies. In such cases, the standard deviation may
relation , which is defined by be chosen to be the same kind of standard deviation used
in the effect sizes of other studies to which the study will
2 2
  _______
B
2  2
 ___
2
B
. (18.1) be compared. I focus on three effect sizes that seem likely
B W T
to be the most widely used.
The intraclass correlation  can be used to obtain one If W  0 (hence   1), one effect-size parameter has
of these variances from any of the others, given that the form
B2  T2 and W2  (1 )T2  (1 )B2 /.
T  C
The intraclass correlation describes the degree of clus- W  _______

W .

(18.2)
tering in the data. Later in this section, I show that the
variance of effect-size estimates in nested designs de- This effect size might be of interest, for example, in a
pends on the intraclass correlation, and how the intraclass meta-analysis where the other studies to which the cur-
correlation can be used to relate different possible effect rent study is compared are typically single site studies. In
sizes that can be computed in multilevel designs to one such studies W may (implicitly) be the effect size esti-
another. mated and hence W might be the effect size most compa-
18.2.1.2 Primary Analyses Many analyses reported rable with that in other studies.
in the literature fail to take the effects of clustering into A second effect-size parameter is of the form
account or do so but by analyzing cluster means as the
T  C
raw data. In either case, the treatment effect (the mean T _______

T .

(18.2)
difference between treatment and control groups), the
standard deviations SB, SW, and ST , and the sample sizes This effect size might be of interest in a meta-analysis
may be reported (or deduced from information that is re- where the other studies are multisite studies or sample
ported). In other cases the analysis reported may have from a broader population but do not include clusters in
taken clustering into account by treating clusters as ran- the sampling design. This would typically imply that they
dom effects. Such an analysis (for example using the pro- used an individual, rather than a cluster, assignment strat-
gram HLM, SAS Proc Mixed, or the STATA routine XT- egy. In such cases, T might be the most comparable with
Mixed) usually yields direct estimates of the treatment the effect sizes in other studies.
effect b and the variance components  B2 and  W2 and their If B  0 (and hence   0), a third possible effect-size
standard errors. This information can be used to calculate parameter would be
both the effect size and its approximate standard error.
T  C
Let  B2 and  W2 be the estimates of B2 and W2 and let V(b), B  _______

B .

(18.4)
V(  B2) and V(  W2 ) be their variances (the square of their
standard errors). Generally, V(  W2 ) depends primarily on This effect size is less likely to be of general interest,
the number of individuals and V(  B2) on the number of but it might be in cases where the treatment effect is de-
EFFECT SIZES IN NESTED DESIGNS 341

fined at the level of clusters and the individual observa- 18.2.3.1 Estimating W We start with estimation of
tions are of interest because the average defines an ag- W, which is the most straightforward. If  1 (so that
gregate property. For example, Anthony Bryk and Barbara W  0 and W is defined), the estimate
Schneider used the concept of trust in schools as an ag- __ __
Y T Y C
gregate property that is measured by taking means of re- dW  _______

S

(18.7)
W
ports from individuals (2000). The effect size B might be
the appropriate effect size to quantify the effect of an in- is a consistent estimator of W , which is approximately
tervention designed to improve school trust. The effect normally distributed about W . When cluster sample sizes
size B may also be of interest in a meta-analysis where are equal the (estimated) variance of dW is
the studies being compared are typically multisite studies  (n 1) d 2

that have been analyzed using cluster means as the unit of ( N N ) ( 1__________
NT  NC
vW  _______
T C 1  ) 2(N  M)
 ________ . W
(18.8)
analysis.
Note that the presence of the factor (1 ) in the de-
Note, however, that though W and T will often be
nominator of the first term is possible because W is de-
similar in size, B will typically be much larger (in the
fined only if   1.
same population) than either W or T, because B is typi-
Note that if   0 so that there is no clustering, equa-
cally considerably smaller than W and T. Note also that
tion 18.8 reduces to the variance of a mean difference
if all of the effect sizes are defined (that is, if 0    1),
divided by a standard deviation with (N  M) degrees of
and  is known, any one of these effect sizes may be ob-
freedom (see chapter 12, this volume). The leading term
tained from any of the others. In particular, both W and T
of the variance in equation 18.8 arises from uncertainty in
can be obtained from B and  because
the mean difference. Note that it is [1 (n 1)]/(1  )]
____
 T as large as would be expected if there were no clustering

W  B _____
1 
 ______
_____
1 
(18.5)
in the sample, that is, if   0. Thus [1  (n 1)]/(1 )]
is a kind of variance inflation factor for the variance of
and the effect-size estimate dW.
__ ____ When cluster sizes are not equal, the estimator dW of W
T  B   W 1  . (18.6) is the same as in the case of equal cluster sample sizes,
but the (estimated) variance of the estimator becomes
 ( 1) d 2
18.2.3 Estimates of Effect Sizes
in Two-Level Designs
(
vW  N
T  NC
_______
N TNC ) ( 1__________
1  )  2(N  M) ,
________W
(18.9)

where
Formulas for effect-size estimates, and particularly their mT mC
variances, are considerably simpler when sample sizes are NC ( niT )2 N T ( niC )2
equal than when they are unequal. Many studies are  ________
i1
 ________
i1 . (18.10)
planned with the intention of having equal sample sizes in N TN NCN
each cluster because this is simpler and maximizes statis- When all of the niT and niC are equal to n,  n and
tical power. Although the eventual data collected may not equation 18.9 reduces to equation 18.8.
have equal sample sizes, sizes are often rather close to When the analysis properly accounts for clustering by
being so. As a practical matter, exact cluster sizes are fre- treating the clusters as having random effects, and an esti-
quently not reported, so the research synthesist often has mate b of the treatment effect, its variance V(b), and an
access only to the average sample sizes and thus may have estimate  W2 of the variance component W2 is available (for
to proceed as if the sizes are equal. Unless they are pro- example, from an HLM analysis), the estimate of W is
foundly unequal, proceeding as if they are equal to some
b .
reasonable compromise value (such as the harmonic mean) dW  ___
 (18.11)
W
will yield reasonably accurate results. In the sections that
follow, I present results first when clusters are assumed to The approximate variance of the estimate given in
have equal sample size n, that is, when niT  n, i  1, . . . , equation 18.11 is
mT and niC  n, i  1, . . . , mC. In this case, N T  nmT, V(b)
vW  ____ . (18.12)
N C  nm C, and N  N T  N C  n(mT  m C)  nM.  2 W
342 SPECIAL STATISTICAL ISSUES AND PROBLEMS

18.2.3.2 Estimating B Estimating the other B and and


T is less intuitive than estimating W. For example, one mC

C (1/ni )
might expect that 1
n IC  ___ C .

__ __
m i1
 Y
T C
Y_______
SB
, When cluster sample sizes are unequal, the variance of
dB is approximately
would be that natural estimator of B, but this is not the
case. The reason is that SB is not a pure estimate of B,  (B 1) nBCdB2
because it is inflated by the within-cluster variability. In vB  _______
mTmC )(
mT  mC 1___________
( B  )
 _____________________
2(M  2)2[1  (nB  1)]
,
particular, the expected value of SB2 is (18.16)
2
B2  ___
n .
W
where
Whenever B  0 (so that   0 and B is defined), the m nI  m nI T C 1

intraclass correlation may be used to obtain an estimate ( C T


nB  ___________
M ) ,
of B using SB. When all the cluster sample sizes are equal
to n, a consistent estimate of B is C  (M  2)2  2[(mT 1) nIT  (mC 1) nIC ](1)
__ __ _________  [(mT 2)nIT2  (mC 2)nIC2  (
nIT )2  (nIC)2](1)2,
 (n 1)
Y  Y 1__________

T C
dB  _______
S n , (18.13) mT

T (1/ni ) ,
B
1
nIT2  ___ T 2
which is normally distributed in large samples with (esti- m i1
mated) variance
and
 (n 1) [1  (n 1)]dB2
mT  mC
(
vB  _______
mTmC )( 1__________
n  _____________
2n(M 2)
.) (18.14)
mC
1
nIC2  ___
mC
(1/niC ) 2.
i1
Note that the presence of  in the denominators of the
variance terms is possible since B is only defined if Note that when the n and niC are all equal to n,
T
i

  0.
The variance in equation 18.14 is [1  (n 1)]/n] nB  n, B  n,
as large as the variance of the standardized mean differ- n  n  1/n, nIT2  nIC2  1/n2,
T
I
C
I
ence computed from an analysis using cluster means as
the unit of analysis, that is, applying the usual formula and
for the variance of the standardized difference between
the means of the cluster means in the treatment and con- (M  2)[1  (n 1)]2
C  __________________
trol group. Thus [1  (n 1)]/n is a kind of variance 2 n
inflation factor for the variance of effect-size estimates
like dB. so that 18.15 reduces to 18.13 and 18.16 reduces to 18.14.
__
When the cluster sizes are not equal, an estimator of B When cluster sample sizes are unequal, substituting nB for
that is a generalization of equation 18.13 is given by n in equations 18.13 and 18.14 would give results that are
__ __ _________ quite close to the exact values.
 (nB 1)
Y T  Y C
(
dB  _______
SB ) 1__________
nB , (18.15)
When the analysis properly accounts for clustering by
treating the clusters as having random effects, and an
estimate b of the treatment effect, its variance V(b), and
where
an estimate  B2 of the variance component B2 is avail-
(m  1)nI  (m 1) C 1 able (for example, from an HLM analysis), the estimate
( M2
T nI
nB  ____________________
T C
) , of B is
mT

T (1/ni )
1
n IT  ___ b .
m
T ,
dB  ___

(18.17)
i1 B
EFFECT SIZES IN NESTED DESIGNS 343

The approximate variance of the estimate given in When all of the niT and niC are equal to n, nUT  nUC  n
equation 18.17 is and 18.21 reduces to 18.19.
b2V(  2)
The variance of dT when cluster samples sizes are un-
V(b)
vB  ____
 2
 _______
4  6
B .
(18.18) equal is somewhat more complex. It is given by
B B

N T  NC
18.2.3.3 Estimating T An estimate of T can also be
obtained using the intraclass correlation. When all the
(
vT  _______
T C
N N )
(1  ( 1))

cluster sample sizes are equal to n, an estimator of T is [(N2)(1)2  A2  2B(1)]d2 T,


 ______________________________
2(N  2)[(N  2)  (N  2  B)]
(18.22)
__ __ __________
2(n 1)
Y Y
( ) 1
T C
dT  _______ ________ ,
S

N 2 (18.19) where the auxiliary constants A and B are defined by A 
AT  AC,
T

which is normally distributed in large samples with (an


( )
mT mT 2 mT
estimated) variance of (NT)2 ( nTi )2  ( nTi )2  2NT ( nTi )3
AT  _____________________________
i1 i1 i1
,
NT  NC
(
vT  _______
NTNC )
(1  (n 1))
(NT)2

( )
mC mC 2 mC

(N  2)(1 )2  n(N  2n)2  2(N  2n)(1 ) (NC)2 ( niC )2  ( niC )2 2NC ( niC )3
(
 dT2 _________________________________________
2(N  2)[(N  2)  2(n  1) ]
. ) i1 i1
AC  _______________________________
(N C )2
i 1 ,
(18.20)
and
Note that if   0 and there is no clustering, dT reduces
to the conventional standardized mean difference and B  nUT (mT 1)  nCU (mC 1).
equation 18.20 reduces to the usual expression for the
variance of the standardized mean difference (see chapter When niT and niC are equal to n, A  n(N  2n), nUT 
12, this volume). n  n and B  (N  2n) so that equation 18.22 reduces to
C
U
The leading term of the variance in equation 18.20 18.20. These expressions suggest that if cluster sample
arises from uncertainty in the mean difference. Note that sizes are unequal, substituting the average of nUT and nUC
this leading term is [1  (n  1)] as large as would be into equations 18.19 and 18.20 would give results that are
expected if there were no clustering in the sample (that is quite close to the exact values.
if   0). The expression [1  (n  1)] is the variance When the analysis has been carried out using an ana-
inflation factor mentioned by Allan Donner and col- lytic method that properly accounts for clustering by
leagues (1981) and the design effect mentioned by Leslie treating clusters as random effects, the estimate of T is
Kish (1965) for the variance of means in clustered sam- b
dT  ________
_______ . (18.23)
ples and also corresponds to a variance inflation factor for  B2  W2
the effect-size estimates like dT.
When cluster sample sizes are unequal, the estimator Assuming that V(  W2) is negligible and that  B2 is inde-
of T becomes pendent of the estimated treatment effect b, the approxi-
__ __ ____________________________ mate variance of the estimate 18.23 is
Y  Y
( ) 1 ( nTU mT  nCU mC)  nUT  nCU  2
)
T C
(N
dT  _______
S
___________________________
N2
, V(b) b2V(  2 )
T
vT  _______
 2  2
 ___________
B
4  2  2 3
. (18.24)
(18.21) B W ( B W )

where mT
18.2.3.4 Confidence Intervals for W , B , and T
(N )  ( n
T 2
)
T 2
i
The results here can also be used to compute confidence
nUT  ______________
i1 , intervals for effect sizes. If  is any one of the effect sizes
T T N (m 1) mentioned, d is a corresponding estimate, and vd is the
and estimated variance of d, then a 100(1  ) percent confi-
mC
dence interval for  based on d and vd is given by
(N C )2  ( niC )2
nU  ______________
C i1 .
d  c/2vd    d  c/2vd,
N C(mC 1) (18.25)
344 SPECIAL STATISTICAL ISSUES AND PROBLEMS

where c / 2 is the 100(1 /2) percent point of the stan- Example. An evaluation of the connected mathematics
dard normal distribution (for example, 1.96 for /2  curriculum reported by Ridgway, and colleagues com-
0.05/2  0.025). pared the achievement of mT 18 classrooms of sixth
grade students who used connected mathematics with that
18.2.4 Applications in Two-Level Designs of mC  9 classrooms in a comparison group that did not
use connected mathematics (2002). In this quasi-experi-
This chapter is intended to be useful in deciding what ef- mental design the clusters were classrooms. The cluster
fect sizes might be desirable to use in studies with nested sizes were not identical but the average cluster size in the
designs. It has described ways to compute effect-size es- treatment groups was N T/mT  338/18 18.8 and N C/mC
timates and their variances from data that may be re- 162/18 18 in the control group. The exact sizes of all
ported. I illustrate applications in this section. the clusters were not reported, but here I treat the cluster
18.2.4.1 Effect sizes for individuals The results in sizes as if they were equal and choose n 18 as a slightly
this chapter can be used to produce effect-size estimates conservative sample size. The mean
__ difference between
__
and their variances from studies that report (incorrectly treatment and control groups is Y T  Y C 1.9, the pooled
analyzed) experiments as if there were no nested factors. within-groups standard deviation ST  12.37. This evalu-
The required means, standard deviations, and sample sizes ation involved sites in all regions of the country and it
can usually be extracted from what may be reported. was intended to be nationally representative. Ridgeway
Suppose it is decided that the effect size T is appropri- and colleagues did not give an estimate of the intraclass
ate because most other studies both assign and sample correlation based on their sample. Larry Hedges and Eric
individually from a clustered population. Suppose that Hedberg have provided an estimate of the grade 6 intra-
the data are analyzed by ignoring clustering, then the test class correlation in mathematics achievement for the na-
statistic is likely to be either tion as a whole, based on a national probability sample,
______ __ __ of 0.264 with a standard error of 0.019 (2007). For this
 Y
( )
T C


NTNC Y example, I assume that the intraclass correlation is identi-
t  _______
NT  NC
_______
ST cal to that estimate, namely   0.264.
Suppose that the analysis ignored clustering and com-
or F  t2. Either can be solved for
pared the mean of all of the students in the treatment with
__ __
 Y
the mean of all of the students in the control group. This
( )
T C
Y
_______ ,
ST leads to a value of the standardized mean difference of
__ __
 Y
T C
Y
which can then be inserted into equation 18.19 along with _______
ST
 0.1536,
 to obtain dT . This estimate (dT ) of T can then be in-
serted into equation 18.20 to obtain vT , an estimate of the which is not an estimate of any of the three effect sizes
variance of dT . considered here. If an estimate of the effect size T is
Alternatively, suppose it is decided that the effect size desired, and we had imputed an intraclass correlation of
W is appropriate because most other studies involve only   0.264, then we use equation 18.19 to obtain dT 
a single site. We may begin by computing dT and vT as (0.1536)(0.9907)  0.1522.
before. Because we want an estimate of W, not T, we use The effect-size estimate is very close to the original
the fact given in equation 18.5 that standardized mean difference even though the amount of
_____ clustering in this case is not small. However this amount
W  T / 1  of clustering has a substantial effect on the variance of the
effect-size estimate. The variance of the standardized
and therefore mean difference ignoring clustering is
_____
dT / 1  324  162 _______________
_________ 0.15312
324 * 162  2(324  162  2)  0.009259.
(18.26)

is an estimate of W with a variance of However, computing the variance of dT using equa-


tion 18.20 with   0.264, we obtain a variance estimate
vT / (1 ). (18.27) of 0.050865, which is 549 percent of the variance ignor-
EFFECT SIZES IN NESTED DESIGNS 345

ing clustering. A 95 percent confidence interval for T is is an estimate of T with a variance of


given by
________ vB. (18.29)
0.2899  0.1522 1.960.050865
________
 T  0.1522  1.960.050865  0.5942. Alternatively, suppose it is decided that the effect size
W is the desired effect size. We may begin by computing
If clustering had been ignored, the confidence interval dB and vB as before. Because we want an estimate of W,
for the population effect size would have been 0.0350 not B, we use the fact given in equation 18.5 that
to 0.3422. _______
If we wanted to estimate W, then an estimate of W W  B/(1)
given by expression 18.26 is
0.1522
_________
and therefore
________  0.1774,
1 0.264 _______
dW  dB/(1) (18.30)
with variance given by expression 18.27 as

0.050865/(1 0.264)  0.06911, is an estimate of W with a variance of

and a 95 percent confidence interval for W based on 18.25 vB /(1). (18.31)
would be
_______
0.3379  0.17741.960.06911 18.3 DESIGNS WITH THREE LEVELS
_______
 W  0.1774  1.960.06911  0.6926. The methods described in section 18.2 are appropriate for
computing effect-size estimates and their variances from
18.2.4.2 Effect when Clusters Are the Unit of Anal- designs that involve two stages of sampling (one nested
ysis The results in this chapter can also be used to obtain factor or so-called two-level models). Many social ex-
different effect size estimates when the data have been periments and quasi-experiments, however, assign treat-
analyzed with the cluster mean as the unit of analysis. In ments to high level clusters such as schools or communi-
such cases, the researcher might report a t-test or a one ties when lower levels of clustering (two nested factors or
way analysis of variance carried out on cluster means, but so called three-level models) are also available. For ex-
we wish to estimate T. In this case, the test statistic re- ample, schools are frequently units of assignment in ex-
ported will either be periments, but students are nested within classrooms that
______ __T __
Y Y
( ) are in turn nested within schools. Most studies in educa-
C

mTmC _______
t  _______
mT  mC S B tion that sample students by first sampling schools must
take into account this nesting. A similar situation arises
or F  t2. Either can be solved for when individuals are nested within institutions nested
__ __
 Y
( ) within communities, but communities are assigned to
T C
Y
_______
SB , public health interventions.
Although I use the terms schools and classrooms to
which can then be inserted into equation 18.13 along with characterize the stages of clustering, this is merely a mat-
 to obtain dB. This estimate of dB can then be inserted ter of convenient and readily understandable terminol-
into equation 18.14 to obtain vB, an estimate of the vari- ogy. These results apply equally well to any situation in
ance of dB. Because we want an estimate of T, not B, we which there is a three-stage sampling design and treat-
use the fact given in equation 18.6 that ments are assigned to clusters (for example, schools).
__
T  B
18.3.1 Model and Notation
and therefore
__ As in section 18.2.1 to adequately define effect sizes and
dB (18.28) their variances, I need to first describe the notation. Let
346 SPECIAL STATISTICAL ISSUES AND PROBLEMS

YijkT (i 1, . . . , mT; j  1, . . . , piT; k  1, . . . , nijT) and Yijk


C and the (pooled) between classrooms but within treat-
(i  1, . . . , m ; j  1, . . . , pi ; k  1, . . . , nij ) be the kth
C C C 2 via
ment groups and schools variance SBC
observation in the jth classroom in the ith school in the piT piC
mT __ __ mC __ __
treatment and control groups respectively. Thus, in the
treatment group, there are mT schools, the ith school
(Y
i1 j1
 Y iT )2  ( Y ijC  Y i
T
ij
i1 j1
C 2
)
2  ________________________________ .
SBC PM
has piT classrooms, and the jth classroom in the ith school
has nijT observations. Similarly, in the control group, there
are mC schools, the ith school has piC classrooms, and the Suppose that observations within the jth classroom in
jth classroom in the ith school has nijC observations. Thus the ith school within the treatment and control group
there is a total of M  mT  mC schools, a total of groups are normally distributed about classroom means
ijT and ijC with a common within-classroom variance WC
2 .
mT mC
That is
P piT  piC
i1 i1
T ~ N  T, 2 i  1, . . . , mT; j  1, . . . , piT;
classrooms, and a total of
Yijk ( ij WC ),
mT piT mC piC k  1, . . . , nijT
N  NT  NC  n  nT
ij
C
ij
i1 j1 i1 j1 and
observations
__ overall. __
C ~ N  C, 2
Let Y iT (i 1, . . . , mT ) and Y iC (i 1, . . . , mC ) be the Yijk ( ij WC ), i  1, . . . , mC; j  1, . . . , piC;
means of the ith school
__ in the treatment and control groups, __ k  1, . . . , nijC.
respectively, let Y ijT (i 1, . . . , mT; j 1, . . . , pTi ) and Y ijC
(i 1, . . . , mC; j  1, . . . , pCi ) be the means of the jth class
Suppose further that the classroom means are random
in the ith school in__the treatment__ and control groups, re-
T C effects (a sample from a population of means, for exam-
spectively, and let Y and Y be the overall means in the
ple) so that the class means themselves have a normal
treatment and control groups, respectively. Define the
2 via
distribution about the school means iT and iC and com-
(pooled) within treatment groups variance SWT
mon variance BC2 . That is,

piT nijT piC nijC


mT __ mC __
( YijkT  YT )2 
i1 j1 k1
( YijkC YC )2
i1 j1 k1
ijT ~ N( iT,BC
2 ,
) i  1, . . . , mT; j  1, . . . , piT
2  ______________________________________ ,
SWT N2
and
2 via
the pooled within-schools variance SWS
mT piT nijT
mC piC nijC ijC ~ N( iC,BC
2 ,
) i  1, . . . , mC; j  1, . . . , piC.
__ __
(Y T
ijk Y T 2
i )  (Y C
Y
ijk )
C 2
i
i1 j1 k1 i1 j1 k1
2  ______________________________________
SWS , Finally, suppose that the school means iT and iC are
NM also normally distributed about the treatment and con-
and the (pooled) within-classroom sample variance trol group means T and C with common variance BS 2 .

2 via
SWC That is,
nijT nijC
iT ~ N( T,BS i  1, . . . , mT
piT piC
mT __ mC __ 2 ,
)
(Y  Y )  ( Yijk
i1 j1 k1
T
ijk
i1 j1 k1
C
YCij )2
T 2
ij

SWC 
2 ______________________________________ .
NP and
Define the between schools but within treatment groups iC ~ N( C,BS i  1, . . . , mC.
2 via
2
),
variance SBS
mT __ __ mC __ __ Note that in this formulation, BC
2 represents true varia-
(Y
 Y T )2  ( Y i
T
i
C
 Y
C 2
)
tion of the population means of classrooms over and
j1 j1
___________________________
2  ,
SBS M2 above the variation in sample means that would be ex-
EFFECT SIZES IN NESTED DESIGNS 347

pected from variation in the sampling of observations correlations describe the degree of clustering, just as the
into classroom. Similarly, BS 2 represents the true varia- single intraclass correlation describes the degree of clus-
tion in school means, over and above the variation that tering in designs with one level of nesting. The variance
would be expected from that attributable to the sampling of effect-size estimates depends on the intraclass correla-
of observations into schools. tions in designs with two levels of nesting, just as the vari-
These assumptions correspond to the usual assump- ance of effect-size estimates depends on the (single) intra-
tions that would be made in the analysis of a multisite class correlation in designs with one level of nesting.
trial by a three-level hierarchical linear models analysis, 18.3.1.2 Primary Analyses Many analyses reported
an analysis of variance (with treatment as a fixed effect in the literature fail to take the effects of clustering into
and schools and classrooms as nested random effects), or account or take clustering into account by analyzing clus-
a t-test using the school means in treatment and control ter means as the raw data. In either case, the treatment
group as the unit of analysis. effect (the mean difference between treatment and con-
18.3.1.1 Intraclass Correlations In principle, there trol groups), the standard deviations SBS , SBS , SWC , and
are several within-treatment group variances in a three- SWT , and the sample sizes may be reported (or deduced
level design. I have already defined the within-classroom, from information that is reported). In other cases, the
between-classroom, and between-school, variances WC 2 ,
analysis reported may have taken clustering into account
2 , and  2 . There is also the total variance within treat-
BC BS by treating clusters as having random effects. Such an
ment groups WT 2 defined by
analysisfor example, using the program HLM, SAS
Proc Mixed, or the STATA routine XTMixedusually
2  2  2  2 .
WT BS BC WC yields direct estimates of the estimated treatment effect b
and the variance components BS 2 ,  2 and  2 and their
BC WC
In most educational achievement data when clusters standard errors. This information can be used to calculate
are schools, BS2 and  2 are considerably smaller than
BC both the effect size and its approximate standard error.
WC
2 . Obviously, if the between school and classroom vari- 2 , 
Let  BS 2 , and 
BC 2 be the estimates of  2 ,  2 and
WC BS BC
ances BS2 and  2 are small, then  2 will be very similar
BC WT
2 and let V(b), V( 
WC 2 ), V( 
BS 2 ), and V( 
BC 2 ) be their
WC
to WC
2 .
variances (the square of their standard errors). Generally,
In two-level models, such as those with schools and V(  WC
2 ) depends primarily on the number of individuals,

students as levels, we saw in section 18.2.1.1 that the re- whereas V(  BS 2 ) and V(  2 ) depend on the number of
BC
lation between variances associated with the two levels schools and classrooms, respectively. Because the num-
was characterized by an index called the intraclass cor- ber of individuals typically greatly exceeds the number of
relation. In three-level models, two indices are needed to schools and classrooms, V(  WC 2 ) will be so much smaller

characterize the relationship between these variances. 2 ) and V( 


than either V(  BS 2 ) that V( 
BC 2 ) can be consid-
WC
Define the school-level intraclass correlation S by ered negligible in comparison.
2 2
S  _____________
BS
2  2  2
 ____
2
BS
. (18.32) 18.3.2 Effect Sizes
BS BC WC WT

Similarly, define the classroom level intraclass correla- In three-level designs, several standardized mean differ-
tion C by ences are possible because at least five standard devia-
2 2
tions can be used. The choice should be determined on
C  _____________
BC
2  2  2
 ____
2
BC .
(18.33) the basis of the inference of interest to the researcher. If
BS BC WC WT
the effect-size measures are to be used in meta-analysis,
These intraclass correlations can be used to obtain an important inference goal may be to estimate parame-
one of these variances from any of the others, given that ters comparable with those that can be estimated in other
BS
2
 S WT
2
, BC
2
 C WT
2
, and WC
2
 (1 S  C) WT2
. studies. In such cases, the standard deviation chosen
Note that if BS  0, then S  0 and C reduces to  given might be the same kind used in those to which the study
in (18.1) and that if BC  0, C  0 and S reduces to  will be compared. Three effect sizes seem likely to be the
given in (18.1). most widely used, but two others might be useful in some
In designs with two levels of nesting, the two intraclass circumstances.
348 SPECIAL STATISTICAL ISSUES AND PROBLEMS

One effect size standardizes the mean difference by the be of general interest, but might be of interest when the
total standard deviation within-treatment groups. It is in outcome variable on which the treatment effect is defined
the form is conceptually at the level of schools and the individual
T  C
observations are of interest only because the average de-
T  C
WT  _______
WT  _____________
_______________
. (18.34) fines an aggregate property.
BS2  BC
2  2
WC
If BC  0 (and hence C  0), a fifth possible effect
The effect size WT might be of interest in a meta-anal- size would be
ysis where the other studies have three-level designs or T  C 
sample from a broader population but do not include BC  _______

BC  ___
____
WT
 . (18.38)
C
schools or classes in the sampling design. This would
typically imply that they used an individual, rather than a The effect size BC may also be of interest in a meta-
cluster, assignment strategy. In such cases, WT might be analysis when the studies being compared are typically
the most comparable effect size. multilevel studies analyzed using classroom means as the
2   2  0 (and hence   1), a second effect
If BC WC S
unit, or when the outcome is conceptually defined as a
size can be defined that standardizes the mean difference classroom-level property. It might be of interest when the
by the standard deviation within-schools. It is in the form outcome variable and the individual observations are of
interest only because the classroom average defines an
T  C 
WS  __________

_________  _______
WT
_____ . (18.35) aggregate property.
BC
2  2
WC 1 S Because S and C will often be rather small, WS and
The effect size WS might be of interest in a situation WC will often be similar in size to WT. For the same rea-
where most of the studies of interest had two level de- son, BS and BC will typically be much larger in the same
signs involving multiple schools. In such studies WS may population than WS, WC, or WT. Note also that if all of
(implicitly or explicitly) be the effect size estimated and the effect sizes are defined, that is, if 0  S  1, 0  C  1,
hence be most comparable with the effect-size estimates and S  C  1, any one of these effect sizes may be ob-
of the other studies. tained from any of the others. In particular, equations
If WC  0 (and hence S  C  1), a third effect size 18.34 through 18.38 can be solved for WT in terms of any
can be defined that standardizes the mean difference by of the other effect sizes, S, and C. These same equations,
the within-classroom standard deviation. It has the form in turn, can be used to define any of the five effect sizes in
_____ terms of WT.
WT
T  C __________ WS1S .
WC  _______
WC  ________  __________
________ (18.36)
1  1 
S C
S C
18.3.3 Estimates of Effect Sizes
The effect size WC might be of interest, for example, in in Three-Level Designs
a meta-analysis where the other studies to which the cur-
rent study is compared are typically studies with a single As in two-level designs, formulas for estimates and their
school. In such studies, WC may (implicitly) be the effect variances are much simpler for balanced designs when all
size estimated and thus WC might be the most compara- cluster sample sizes are equal. Here I present estimates of
ble effect size. the effect sizes and their approximate sampling distribu-
Two other effect sizes that standardize the mean differ- tions, first for when all the schools have the same number
ence by between classroom and between school standard ( p) of classroomsthat is, when piT  p, i 1, . . . , mT
deviations may be of interest, though less frequently. If and piC  p, i 1, . . . , mCand all the classroom sample
BS  0 (and hence S  0), a fourth possible effect size sizes are equal to nthat is, when nijT  n, i 1, . . . , mT;
would be j 1, . . . , p and nijC  n, i 1, . . . , mC; j 1, . . . , p.
In this case, N T  npmT, NC  npmC, and N  N T 
T  C 
WT . NC  np(mT  mC)  nM.
BS  _______

BS  __
____
 (18.37)
S 18.3.3.1 Estimating WC We start with estimation of
The effect size BS may also be of interest in a meta- WC, which is the most straightforward. If S  C  1 (so
analysis where the studies being compared are typically that WC  0), then WC is defined then the estimate
__ __
multilevel studies that have been analyzed by using Y T Y C
school means as the unit of analysis. This is less likely to dWC  ________

S

(18.39)
WC
EFFECT SIZES IN NESTED DESIGNS 349

is a consistent estimator of WC that is approximately nor- is the total number of classrooms, and  NTNC/(NT  NC).
mally distributed with mean WC. If cluster sizes are equal Note that if all the niT and niC are equal to n, then nU  n, and
the estimated variance of dWC is that if all the piT and piC are equal to p, then pU  pn, and
1 ( pn 1)  (n 1) d2 P  Mp, so that equation 18.41 reduces to equation 18.40.
vWC  _____________________
mpn(1

S
  )
C
 _________
WC
2(N  Mp)
, (18.40) When the analysis properly accounts for clustering by
S C
treating the clusters as having random effects, and an es-
where timate b of the treatment effect and  WC 2
of the variance
T C
component WC 2
is available (for example, from an HLM
mm .
 _______
m mT  mC
analysis), the estimate of WC is
b .
The leading term of the variance in equation 18.40 dWC  ____
 (18.42)
WC
arises from uncertainty in the mean difference. It is [1
( pn 1)S  (n  1)C]/(1 S  C)] as large as would be The approximate variance of the estimate given in
expected if there were no clustering in the sample, that is, 18.42 is
if S  C  0. Thus [1  ( pn 1)S  (n 1)C]/(1 S 
V(b)
C)] is a kind of variance inflation factor for the variance vWC  ____
 2
. (18.43)
WC
of the effect-size estimate dWC.
Note that if either S  0 or C  0, there is no clustering 18.3.3.2 Estimating WS If S  1, then WS is defined
at one level and that the three-level design logically re- an estimate of WS can also be obtained from the mean
duces to a two-level design. If S  0, the expression 18.39 difference and SWS if the intraclass correlations S and C
for dWC and the expression 18.40 for its variance reduce to are known or can be imputed. When all the cluster sample
equations 18.7 and 18.8 in section 18.2.3 for both the cor- sizes are equal, a consistent estimator of WS is
responding effect-size estimate dW and its variance in a __ __ ______________
(n 1)C
Y  Y
( ) 1
T C
two-level design, with clusters of size n and intraclass cor- dWS  ________ _____________
S (1 S)(np 1) , (18.44)
relation   C. Similarly, if C  0, expressions 18.39 and WS

18.40 also reduce to equations 18.7 and 18.8 of section and the estimate dWS is normally distributed in large sam-
18.2.3, but with cluster size np and intraclass correlation ples with mean WS variance
  S. If S  C  0 so that there is no clustering by class-
1  (pn 1)  (n 1)
room or by school, the design is logically equivalent to a vWS  _____________________
S
mpn(1
)
C

one-level design. In this case, the expression 18.39 for dWT S

__ 2 __
and expression 18.40 for its variance reduce to the usual ( pn 1)  2n(p1)C  n (p1)C
( )
2 2
 dWS
2 _______________________________________ ,
expressions for the standardized mean difference and its 2M[(pn 1) (1 S)  ( pn 1)(n 1)C (1 S)]
2 2

variance (see, for example, chapter 12, this volume). (18.45)


When cluster sample sizes are unequal, the estimate __
dWC of WC is identical to that in equation 18.39, but the where   (1 S  C).
estimated variance is given by The leading term of the variance in equation 18.45
arises from uncertainty in the mean difference. Note that it
1  ( p 1)  (n  1) d2
vWC  ______________________
U S U C
 ________
WC
2(N  P)
, (18.41) is [1( pn1)S  (n 1)C]/(1 S)] as large as would be
(1 S  C)
expected if there were no clustering in the sample, that is,
where if S  C  0. Thus the term [1 ( pn 1)S  (n 1)C]/
(1 S )] is a kind of variance inflation factor for the vari-
( ) ( )
mT piT 2 mC piC 2

NC nijT NT nijC ance of the effect-size estimate dWS.


i1 j1 i1 j1
pU  ____________  _____________ , If C  0 so that there is no clustering effect by class-
NNT NNC
mT piT mC piC
rooms, the three-level design is logically equivalent to a
NC ( nijT )2 NT ( nCij )2 two-level design. In this case, the expressions 18.44 for
i1 j1
____________ i1 j1
, dWS and 18.45 for its variance reduce to equations 18.7
nU  NNT  ____________
NNC and 18.8 in section 18.2.3 for the corresponding effect-size
mT mC
estimate (called there dW ) and its variance in a two-level
P  piT  piC
i1 i1
design, with clusters of size pn and intraclass correlation
350 SPECIAL STATISTICAL ISSUES AND PROBLEMS

  S. If S  C  0 so that there is no clustering by class correlations S and C are known or can be imputed.
classroom or by school, the design is logically equivalent When cluster sample sizes are equal, a consistent estima-
to a one-level design. In this case, the expression 18.44 tor of WT is
for dWS and expression 18.45 for its variance reduce to __ __ ____________________
2( pn 1)S  2(n  1)C
Y Y
( ) 1
T C
those for the standardized mean difference and its vari- dWT  ________ ____________________ , (18.51)
ance in a one-level design (see chapter 12, this volume). S WT N2
When cluster sample sizes are unequal, an estimate dWS
of WS is which is normally distributed in large samples with mean
__ __ ______________ WT and variance
( 1)C
Y Y
( ) 1
T C
dWS  ________ _____________ , (18.46)
S WS (1 S)(NS 1) vWT
where 1  (pn 1)  (n 1)
 _____________________
S C

( / ) ( / )
MT piT MC piC mpn

(n )
T 2
niT  ( nijC )2 niC __ __ __
)
ij


i1
j1 i1 j1
___________________________
MT  MC
(18.47)  dWT ( pn N  2  n N 2  (N  2) 2  2n N   2 N   2 N 
2 ________________________________________________
S C S C S
2(N  2)[(N  2)  2(pn 1)  2(n 1) ]
C
,
S C
piT (18.52)
niT  nijT,
j1 where N  (N  pn), N __
 (N  2n), and   1 S  C.
piC
The leading term of the variance in equation 18.52
niC  nijC, arises from uncertainty in the mean difference. It is [1
j1
(np 1)S  (n 1)C] as large as would be expected if
and NS  N/M is the average sample size per school. Note there were no clustering in the sample, that is, if S 
that if all of the nijT and nijC are equal to n and all the pjT and C  0. The term [1 (np 1)S  (n 1)C] is a general-
pjC are equal to p, then  n, NS  pn, and (18.46) reduces ization to two stages of clustering of the variance infla-
to 18.45. tion factor mentioned by Allan Donner and colleagues
The variance of dWS in the unequal sample size case is (1981) for the variance of means in clustered samples
rather complex. It is given exactly in Hedges (in press), (with one stage of clustering) and corresponds to a vari-
but a slight overestimatethat is, a slightly conservative ance inflation factor for the effect-size estimate dWT.
estimateof the variance of dWS is If C  0 so that there is no clustering by classrooms, or
1  (pU 1)S  (nU 1)C d2 if S  0 so that there is no clustering by schools, the
vWS  _____________________  ________
WS
2(P  M)
. (18.48) three-level design is logically equivalent to a two-level
(1 S)
design. If S  0, the expression 18.51 for dWT and expres-
When the analysis properly accounts for clustering by sion 18.52 for its variance reduce to equations 18.19 and
treating the clusters as having random effects, and an es- 18.20 in section 18.2.3.3 for the corresponding effect-
timate b of the treatment effect,  WC
2 of the variance com-
size estimate dT and its variance in a two-level design,
ponent WC2 , and  BC of the variance component BC
2 2 are
with clusters of size n and intraclass correlation   C.
availablefor example, from an HLM analysis, the esti- Similarly, if C  0, expressions 18.51 and 18.52 reduce
mate of WS is to equations 18.19 and 18.20 of section 18.2.3.3 for the
b . corresponding effect size in a two-level design, with clus-
dWS  __________
_________ (18.49) ter size np and intraclass correlation   S.
 BC
2
 WC
2

When school or classroom sample sizes are unequal, a


2 ) is negligible and that 
Assuming that V(  WC 2 is in-
BC consistent estimate dWT of WT is
dependent of the estimated treatment effect b, the ap-
__ __ _____________________
proximate variance of the estimate in (18.49) is 2( pU 1)S  2(nU 1)C
Y  Y
( )
T C
dWT  ________ 1 _____________________
N2
, (18.53)
V(b) b2V(  BC
2
) . S WT
vWS  _________
 BC  WC
 ____________
4(  BC  WC
(18.50)
2 2 2 2
)2 where pU and nU are given above. Note that if all of the nijT
18.3.3.3 Estimating WT An estimate of WT can also and nijC are equal to n and all the pjT and pjC are equal to p,
be obtained from the mean difference and SWT if the intra- equation 18.53 is reduced to 18.51.
EFFECT SIZES IN NESTED DESIGNS 351

The exact expression for the variance of dWT is complex and 18.14 in section 18.2.3.2 for the corresponding ef-
when sample sizes are unequal (see Hedges, in press). A fect-size estimate (called there dB ) and its variance in a
slight overestimate, that is, a slightly conservative esti- two-level design with clusters of size n and intraclass cor-
mate, of the variance of dWT is relation   C.
18.3.3.5 Estimating BS If S  0 (so that BS is de-
1  ( pU 1)S  (nU 1)C ________
d 2
vWT  _____________________  2(MWT
 2)
. (18.54) fined), SBS may be used, along with the intraclass correla-

tions S and C, to obtain an estimate of BS . When all the
When the analysis properly accounts for clustering by cluster sample sizes are equal,
treating the clusters as having random effects, and an esti- __ __ __________________
( pn 1)S  (n 1)C
mate b of the treatment effect,  BS

2 of the variance compo- Y  Y 1
T C
dBS  ________ _____________________
pnS (18.59)
2 , 
nent BS 2 of the variance component  2 , and 
BC BC
2 of
WC S BS

the variance component WC 2 are available (for example,

from an HLM analysis), the estimate of WS is is a consistent estimate of BS that is normally distributed
in large samples with mean BS and (estimated) variance
b
dWT  _______________
______________ . (18.55)
 BS2  BC
2  2
WC 1 ( pn 1)  (n 1) [1 ( pn 1)  (n 1) ]d2
vBS _____________________
S C
 _________________________
S C BS .
mpn
S 2(M  2)pnS
2 ) is negligible and that 
Assuming that V(  WC 2 and
BS
(18.60)
 BC
2 are independent of each other and the estimated treat-

ment effect b, The approximate variance of the estimate


The presence of S in the denominator of the estimator
given in 18.55 is
and its variance is possible since the effect-size parameter
V(b) b2[ V(  2 )  V(  2 ) ] exists only if S  0.
vWT  ______________  __________________
BS BC . (18.56)
 2  2  2
BS BC
2
WC
2 2
4( BS  BC  WC )
3
The variance in equation 18.60 is [1 ( pn 1)S 
(n 1)C]/pnS] as large as the variance of the standard-
18.3.3.4 Estimating BC If C  0 (so that BC is de- ized mean difference computed from an analysis using
fined), SBC may be used, along with the intraclass correla- school means as the unit of analysis when S  C  0.
tions S and C, to obtain an estimate of BC . When all of Thus [1 S  (n 1)C]/pnS] is a kind of variance in-
the cluster sample sizes are equal, flation factor for the variance of effect-size estimates
__ __ _____________ like dBS.
1 S  (n  1)C

Y  Y _______________
T C
dBC  ________
S nC (18.57) Note that if C  0 so that there is no clustering by
BC
classroom, expressions 18.59 and 18.60 reduce to the
is a consistent estimate of BC that is normally distributed corresponding estimate (called there dB) in equation 18.13
in large samples with mean BC and variance and its variance in equation 18.14 of section 18.2.3.2 in a
two-level design, with cluster size np and intraclass cor-
1   (n 1) [1   (n 1)  ]d 2
vBC  ______________
S C
___________________
S C BC
. (18.58) relation   S.
mpn
C 2M( p 1)n C
18.3.3.6 Unequal Sample Sizes when Using dBC and
The presence of C in the denominator of the estimator dBS There is more than one way to generalize the estima-
and its variance is possible because the effect-size param- tor dBC to unequal school and classroom sample sizes.
eter exists only if C  0. One is to use the means of the classroom means in the
The variance in equation 18.58 is [1 S  (n 1)C]/ treatment and control group in the numerator and stan-
nC] as large as that of the standardized mean difference dard deviation of the classroom means in the denomina-
computed from an analysis using class means as the unit tor. This corresponds to using the classroom means as the
of analysis when S  C  0, that is, computing the first unit of analysis. Another possibility is to use the grand
factor on the right hand side of equation 18.58. Thus [1 means in the treatment and control groups. Similarly, there
S  (n 1)C]/nC] is a kind of variance inflation factor are multiple possibilities for the denominator, such as the
for the variance of effect-size estimates like dBC. mean square between classrooms or the standard devia-
Note that if S  0 so that there is no clustering by tion of the classroom means. When school and classroom
school, the expression 18.57 for dBC and expression 18.58 sample sizes are identical, all these approaches are equiv-
for its variance reduce to those given in equations 18.13 alent in that the effect-size estimates are identical. When
352 SPECIAL STATISTICAL ISSUES AND PROBLEMS

the cluster sample sizes are not identical, the resulting studies that incorrectly analyze three level designs as if
estimators (and their sampling distributions) are also not there were no clustering in the sampling design. The re-
the same. quired means, standard deviations, and sample sizes can-
Similarly, there is more than one way to generalize the not always be extracted from what may be reported, but
estimator dBS to unequal school and classroom sample often it is possible to extract the information to compute
sizes. One possibility is to use the means of the school at least one of the effect sizes discussed here.
means in the treatment and control group in the numera- Suppose it is decided that the effect size WT is appro-
tor and their standard deviation in the denominator. This priate because most other studies both assign and sample
in effect uses the school means as the unit of analysis. individually from a clustered population. Suppose that
Another possibility for the numerator is to use the grand the data in a study are analyzed by ignoring clustering.
means in the treatment and control groups. Similarly, there The test statistic is then likely to be either
are multiple possibilities for the denominator, such as the ______ __ __
 Y
( )
T C


N TNC Y
mean square between schools or the standard deviation of t  _______
N T  NC
________
SWT
school means. When school and classroom sample sizes
are identical, all these approaches are equivalent in that or F  t2. Either can be solved for
the effect-size estimates are identical. When the school and __ __
 Y
( )
T C
classroom sample sizes are not identical, the resulting es- Y
________
SWT
,
timators and their sampling distributions are of course not
the same. which can then be inserted into equation 18.51 or 18.53
Because there are so many possibilities, because some along with S and C to obtain dWT. This estimate of dWT
of them are complex, and because the information neces- can then be inserted into equation 18.52 or 18.54 to ob-
sary to use themmeans, standard deviations, and sam- tain vWT, an estimate of the variance of dWT.
ple sizes for each classroom and schoolis frequently Alternatively, suppose it is decided that the effect size
not reported in research reports, I do not include the esti- WC is appropriate because most other studies involve only
mates or their sampling distributions here. A sensible ap- a single site. We may begin by computing dWT and vWT as
proach if these estimators are needed is to compute the before. Because we want an estimate of WC , not WT , we
variance estimates for the equal sample size cases, substi- use the fact given in equation (18.36) that
tuting the equal sample size formulas.
18.3.3.7 Confidence Intervals for WC , WS , WT , 
WC  ___________
WT
_________
BC , and BS Confidence intervals for any of the effect 1 S  C
sizes described in section 18.3 can be obtained from the and therefore
corresponding estimate using the method described in
dWT
section 18.2.3.4. That is, if  is any one of the effect ___________
_________ (18.61)
sizes mentioned, d is a corresponding estimate, and vd is 1 S  C
the estimated variance of d, then a 100(1 ) percent is an estimate of WC with a variance of
confidence interval for  based on d and vd is given by
vWT
___________
(18.25). _________ . (18.62)
1 S  C
18.3.4 Applications at Three Levels Similarly, we might decide that WS was a more appro-
priate effects size because most other studies involve
This chapter should be useful in deciding what effect
schools that have several classrooms, and nesting of stu-
sizes are desirable in a three-level experiment or quasi-
dents within classrooms is not taken into account in most
experiment. It should also be useful for finding ways to
of them. We may begin by computing dWT and vWT as be-
compute effect-size estimates and their variances from
fore. Because we want an estimate of WS, not WT, we use
data that may be reported.
the fact given in equation 18.35 that
18.3.4.1 Effect Sizes When Individuals Are the Unit
of Analysis The results in section 18.3 can be used to 
WS  _______
WT
_____
produce effect-size estimates and their variances from 1 S
EFFECT SIZES IN NESTED DESIGNS 353

and therefore 18.4 CAN A LEVEL OF CLUSTERING


dWT
_______ SAFELY BE IGNORED?
_____ (18.63)
S
1
One might be tempted to think that accounting for clus-
is an estimate of WS with a variance of tering at the highest level, schools, for example, would be
vWT enough to account for most of the effects of clustering at
_______
_____ . (18.64) both levels. Indeed, the belief among some researchers
1 S that only the highest level of clustering matters in analy-
If we decided that BC or BS was a more appropriate sis, and that lower levels can safely be omitted if they are
effect size, then we could still begin by computing dWT not explicitly part of the analysis, is widely held. Another
and vWT , but then solve 18.38 for BC , namely sometimes expressed is that any level of the design can be
___ ignored if it is not explicitly included in the analysis. It is
BC  WT / C clear from 18.39, 18.44, and 18.51 that the effect-size es-
timates dWC , dWS . and dWT do not depend strongly on the
to obtain intraclass correlation, at least within plausible ranges of
___ intraclass correlations likely to be encountered in social
dWT / C (18.65)
and educational research. The analytic results presented
as an estimate of BC with variance here, however, show that in situations where there is a
plausible amount of clustering at two levels, omitting ei-
vWT / C (18.66) ther level from variance computations can lead to sub-
stantial biases in the variance of effect-size estimators.
or solve equation 18.37 for BS, namely Table 18.1 shows several calculations of the variance of
the effect-size estimate dWT and its variance in a balanced
__
BS  WT / S design__ with __ m  10, n  20, S  0.15, C  0.10, and
d  ( T T  T
C
)/SWT  0.15 (fairly typical values). Omit-
to obtain ting the classroom (lowest level) of clustering leads to
__
underestimating the variance when the number of class-
dWT / S (18.67) rooms per school is small, underestimation that becomes
less serious as the number of classrooms per school in-
as an estimate of BS with variance vWT / S . creases. Omitting it from variance computations leads to

Table 18.1 Variance Effect of Omitting Levels on the Variance of Effect Size dW T

Variance Computed Estimated Variance / True Variance

Ignoring Ignoring Ignoring Ignoring


Ignoring Ignoring Classes Classes Ignoring Ignoring Classes Classes
p dWT vWT Classes Schools   S  C   S  C / p Classes Schools   S  C   S  C / p

1 0.1482 0.0576 0.0385 0.0290 0.0576 0.0576 0.670 0.504 1.000 1.000
2 0.1485 0.0438 0.0343 0.0145 0.0538 0.0440 0.783 0.332 1.229 1.006
3 0.1486 0.0392 0.0329 0.0097 0.0525 0.0394 0.838 0.247 1.341 1.006
4 0.1487 0.0369 0.0321 0.0073 0.0519 0.0371 0.871 0.197 1.407 1.005
5 0.1487 0.0355 0.0317 0.0058 0.0515 0.0357 0.893 0.163 1.451 1.004
8 0.1488 0.0335 0.0311 0.0036 0.0510 0.0336 0.929 0.108 1.524 1.003
16 0.1488 0.0317 0.0305 0.0018 0.0505 0.0318 0.963 0.057 1.591 1.002

source: Authors compilation.


note: These calculations assume n  20, S  0.15, C  0.10, and dWT  0.15
354 SPECIAL STATISTICAL ISSUES AND PROBLEMS

underestimating the variance of dWT by 22 percent when on surveys of health outcomes. Larry Hedges and Eric
p  2, but by only 11 percent when p  5. Because the Hedberg presented a compendium of intraclass correla-
number of classrooms per school is typically small in tions computed from surveys of academic achievement
educational experiments, we judge that variance is likely that used national probability samples at various grade
to be seriously underestimated when the classroom level levels (2007). This compendium provides national values
is ignored in computing variances of effect-size estimates for intraclass correlations as well as values for regions of
in three-level designs. the country and subsets of regions differing in level of
The table makes it clear that omitting the school level urbanicity.
of clustering leads to larger underestimates of variance Although most of these surveys have focused in intra-
when the number of classrooms per school is small. These class correlations in two-level designs, they do provide
misestimates become even more serious when the num- some empirical information and guidelines about patterns
ber of classrooms becomes larger. Omitting the school of intraclass correlations. The National Assessment of
level of clustering from variance computations leads to Educational Progress (NAEP) uses a sampling design
underestimating the variance of dWT by about 67 percent that makes possible the computation of both school and
when p  2, but by 84 percent when p  5. classroom level intraclass correlations (our S and C).
One possible substitute for computing variances using Spyros Konstantopoulos and Larry Hedges used NAEP
three levels is to compute the variances assuming two to estimate school and classroom level intraclass correla-
levels, but use a composite intraclass correlation that in- tions in reading and mathematics achievement (2007).
cludes both between-school and between-class compo- The values of school-level intraclass correlations (S val-
nents. Table 18.1 shows that using S  C in place of the ues) are consistent with those obtained with those Bar-
school-level intraclass correlation performs poorly, lead- bara Nye, Hedges, and Konstantopoulos obtained for a
ing to overestimates of the variance of from 23 percent large experiment involving seventy-one school districts
(for p  2) to 45 percent (for p  5). Using S  C /p in in Tennessee (2000). In both cases, the classroom level
place of the school-level intraclass correlation performs intraclass correlations (C values) are consistently about
quite well, however. two-thirds the size of the school-level intraclass correla-
tions (S values).
18.5 SOURCES OF INTRACLASS In most situations, exact values of intraclass correla-
CORRELATION DATA tion parameters will not be available, so effect sizes and
their variances computed from results in this paper will
Intraclass correlations are needed for the methods de- involve imputing values for intraclass correlations. In
scribed here, but are often not reported, particularly when such cases, it can be helpful to consider a range of values.
these effect sizes are calculated retrospectively in meta- Similarly, it may be useful to ask how large the intraclass
analyses. However, because plausible values of  are es- correlation would need to be to result in a nonsignificant
sential for power and sample size computations in plan- effect size. In some cases, studies may actually report
ning cluster randomized experiments, there have been enough information, between and within-cluster vari-
systematic efforts to obtain information about reasonable ance component estimates, to estimate the intraclass cor-
values of  in realistic situations. Some information comes relation. Even in cases where a study provides an esti-
from cluster randomized trials that have been conducted. mate of the intraclass correlations, such an estimate is
David Murray and Jonathan Blitstein, for example, re- likely to have large sampling uncertainty unless the study
ported a summary of intraclass correlations obtained from has hundreds of clusters. Thus, even in this case, it is
seventeen articles on cluster randomized trials in psychol- wise to consider a range of possible intraclass correla-
ogy and public health (2003). Murray, Sherri Varnell, and tion values. The exact range used should depend on the
Blitstein gave references to fourteen studies that provided amount of clustering plausible in the particular setting
data on intraclass correlations for health related outcomes and outcomes involved. For example, in educational set-
(2004). Other information on reasonable values of  comes tings where the clusters are schools and academic achieve-
from sample surveys that use clustered sampling designs. ment is the outcome, the data reported in Hedges and
For example, Martin Gulliford, Obioha Ukoumunne, and Hedberg suggested that it might be wise to consider a
Susan Chinn (1999) and Vijay Verma and Than Le (1996) range of school-level intraclass correlations from   0.10
have both presented values of intraclass correlations based to 0.30 (2007).
EFFECT SIZES IN NESTED DESIGNS 355

18.6 CONCLUSIONS relations for the Design of Community-Based Surveys and


Intervention Studies: Data from the Health Survey for En-
The clustered sampling structure of nested designs leads to gland 1994. American Journal of Epidemiology 149(9):
different possible effect-size parameters and has an impact 87683.
on both effect-size estimates and their variances. Methods Hedges, Larry V. 1981. Distribution Theory for Glasss Es-
have been presented for using intraclass correlations to timator of Effect Size and Related Estimators. Journal of
compute various effect-size estimates and their variances Educational Statistics 6(2): 10728.
in nested designs. It is important to put the impact of clus- . 2007. Effect Sizes in Cluster Randomized Desi-
tering on meta-analysis in perspective. The results of this gns. Journal of Educational and Behavioral Statistics
chapter show that, though the effects of clustering on the 32(4): 34170.
variance of effect-size estimates can be profound, the ef- . In press. Effect Sizes in Designs with Two Levels
fects of clustering on the value of the estimates them- of Nesting. Journal of Educational and Behavioral Statis-
selves is often modest. Failing to account for clustering in tics.
computing effect-size estimates will therefore typically Hedges, Larry V., and Eric C. Hedberg. 2007. Intraclass
introduce little bias. Failure to account for clustering in Correlations for Planning Group Randomized Experiments
the computation of variances, however, can lead to seri- in Education. Educational Evaluation and Policy Analysis
ous underestimation of these variances. The underestima- 29(1): 6087.
tion of variance can have two consequences. First, it will Kish, Leslie. 1965. Survey Sampling. New York: John Wiley &
lead to studies whose variance has been underestimated Sons.
receiving larger weight than they should. Second, it will Klar, Neil, and Allan Donner. 2001. Current and Future
lead to underestimating the variance of combined effects. Challenges in the Design and Analysis of Cluster Randomi-
However, if only a small proportion of the studies in a zation Trials. Statistics in Medicine 20(24): 372940.
meta-analysis involve nested designs, ignoring clustering Konstantopoulos, Spyros, and Larry V. Hedges. 2007. How
in these studies may have only a modest effect on the Large an Effect Can We Expect from School Reforms?
overall outcome of the meta-analysis. On the other hand, Teachers College Record 110(8): 161340.
if a large proportion of the studies involve clustering, or if Murray, David M., and Jonathan L. Blitstein. 2003. Methods
the studies that do involve clustering receive a large pro- to Reduce the Impact of Intraclass Correlation in Group-
portion of the total weight, then ignoring clustering could Randomized Trials. Evaluation Review 27(1): 79103.
have a substantial impact on the overall results, particu- Murray, David M., Sherri P. Varnell, and Jonathan L. Blits-
larly on the variance of the estimated overall effect size. tein. 2004. Design and Analysis of Group-Randomized
Trials: A Review of Recent Methodological Developments.
18.7 REFERENCES American Journal of Public Health 94(3): 42332.
Nye, Barbara, Larry V. Hedges, and Spyros Konstantopou-
Bryk, Anthony, and Barbara Schneider. 2000. Trust in Scho- los. 2000. The Effects of Small Classes on Achievement:
ols. New York: Russell Sage Foundation. The Results of the Tennessee Class Size Experiment. Ame-
Donner, Allan, Nicolas Birkett, and Carol Buck. 1981. Ran- rican Educational Research Journal 37(1): 12351.
domization by Cluster American Journal of Epidemiology Raudenbush, Stephen W., and Anthony S. Bryk. 2002. Hie-
114(6): 90614. rarchical Linear Models. Thousand Oaks, Calif.: Sage Pu-
Donner, Allan, and Neil Klar. 2000. Design and Analysis of blications.
Cluster Randomization Trials in Health Research. London: Rooney, Brenda L., and David M. Murray. 1996. A Meta-
Arnold. Analysis of Smoking Prevention Programs after Adjust-
. 2002. Issues in the Meta-Analysis of Meta-Analy- ment for Errors in the Unit of Analysis. Health Education
sis of Cluster Randomized Trials. Statistics in Medicine Quarterly 23(1): 4864.
21(19): 297180. Verma, Vijay, and Than Le. 1996. An Analysis of Sampling
Guilliford, Martin C., Obioha C. Ukoumunne, and Susan Errors for Demographic and Health Surveys. Internatio-
Chinn. 1999. Components of Variance and Intraclass Cor- nal Statistical Review 64(3): 26594.
19
STOCHASTICALLY DEPENDENT EFFECT SIZES

LEON J. GLESER INGRAM OLKIN


University of Pittsburgh Stanford University

CONTENTS
19.1 Introduction 358
19.2 Multiple-Treatment Studies: Proportions 358
19.2.1 An Example and a Regression Model 359
19.2.2 Alternative Effect Sizes 362
19.3 Multiple-Treatment Studies: Standardized Mean Differences 365
19.3.1 Treatment Effects, An Example 366
19.3.2 Test for Equality of Effect Sizes and Confidence Intervals for
Linear Combinations of Effect Sizes 367
19.4 Multiple-Endpoint Studies: Standardized Mean Differences 368
19.4.1 Missing Values, An Example 370
19.4.2 Regression Models and Confidence Intervals 371
19.4.3 When Univariate Approaches Suffice 373
19.5 Conclusion 374
19.6 Acknowledgment 376
19.7 References 376

357
358 SPECIAL STATISTICAL ISSUES AND PROBLEMS

19.1 INTRODUCTION most precise estimates when the effect sizes are corre-
lated.
Much of the literature on meta-analysis deals with ana- In each of these above situations, possible dependency
lyzing effect sizes obtained from k independent studies in among the estimated effect sizes needs to be accounted
each of which a single treatment is compared with a con- for in the analysis. To do so, additional information has to
trol (or with a standard treatment). Because the studies be obtained from the various studies. For example, in the
are statistically independent, so are the effect sizes. multiple-endpoint studies, dependence among the end-
Studies, however, are not always so simple. For exam- point measures leads to dependence between the corre-
ple, some may compare multiple variants of a type of sponding estimated effect sizes, and values for between-
treatment against a common control. Thus, in a study of measures correlations will thus be needed for any analysis.
the beneficial effects of exercise on blood pressure, inde- Fortunately, as will be seen, in most cases this is all the
pendent groups of subjects may each be assigned one extra information that will be needed. When the studies
of several types of exercise: running for twenty minutes themselves fail to provide this information, the correla-
daily, running for forty minutes daily, running every other tions can often be imputed from test manuals (when the
day, brisk walking, and so on. Each of these exercise measures are subscales of a test, for example) or from
groups is to be compared with a common sedentary con- published literature on the measures used.
trol group. In consequence, such a study will yield more When dealing with dependent estimated effect sizes,
than one exercise versus control effect size. Because the we need formulas for the covariances or correlations.
effect sizes share a common control group, the estimates Note that the dependency between estimated effect sizes
of these effect sizes will be correlated. Studies of this in multiple-endpoint studies is intrinsic to such studies,
kind are called multiple-treatment studies. arising from the relationships between the measures used,
In other studies, the single-treatment, single-control whereas the dependency between estimated effect sizes
paradigm may be followed, but multiple measures will be in multiple-treatment studies is an artifact of the design
used as endpoints for each subject. Thus, in comparing (the use of a common control). Consequently, formulas
exercise and lack of exercise on subjects health, mea- for the covariances between estimated effect sizes differ
surements of systolic blood pressure, diastolic blood between the two types of studies, necessitating separate
pressure, pulse rate, cholesterol concentration, and so on, treatment of each type. On the other hand, the variances
may be taken for each subject. Similarly, studies of the of the estimated effect sizes have the same form in both
use of carbon dioxide for storage of apples can include types of studynamely, that obtained from considering
measures of flavor, appearance, firmness, and resistance each effect size in isolation (see chapters 15 and 16, this
to disease. A treatment versus control effect-size estimate volume). Recall that such variances depend on the true
may be calculated for each endpoint measure. Because effect size, the sample sizes for treatment and control, and
measures on a common subject are likely to be correlated, (possibly) the treatment-to-control variance ratio (when
corresponding estimated effect sizes for these measures the variance of a given measurement is assumed to be af-
will be correlated within studies. Studies of this type are fected by the treatment). As is often the case in analyses
called multiple-endpoint studies (for further discussions in other chapters in this volume, the results obtained are
of multiple-endpoint studies, see Gleser and Olkin 1994; large sample (within studies) normality approximations
Raudenbush, Becker, and Kalaian 1988; Timm 1999). based on use of the central limit theorem.
A special, but common, kind of multiple-endpoint study
is that in which the measures (endpoints) used are sub- 19.2 MULTIPLE-TREATMENT STUDIES:
scales of a psychological test. For study-to-study com- PROPORTIONS
parisons, or to have a single effect size for treatment ver-
sus control, we may want to combine the effect sizes Meta-analytic methods have been developed for deter-
obtained from the subscales into an overall effect size. mining the effectiveness of a single treatment versus a
Because subscales have differing accuracies, it is well control or standard by combining such comparisons over
known that weighted averages of such effect sizes are a number of studies. When the endpoint is dichotomous
required. Weighting by inverses of the variances of the (such as death or survival, agree or disagree, succeed or
estimated subscale effect sizes is appropriate when these fail), effectiveness is typically measured by several alter-
effect sizes are independent, but may not produce the native metrics: namely, differences in risks (proportions),
STOCHASTICALLY DEPENDENT EFFECT SIZES 359

Table 19.1 Proportions of Subjects Exhibiting Heart Disease for a Control and
Three Therapies.

Treatment

Control T1 T2 T3

Study n p n p n p n p

1 1000 0.0400 4000 0.0250 4000 0.0375


2 200 0.0500 400 0.0375
3 2000 0.0750 1000 0.0400 1000 0.0800 1000 0.0500

source: Authors compilation.


note: Here, n denotes the sample size and p the proportion.

log risk ratios, log odds ratios, and arc sine square root occurrence of heart disease is high, it will not take much
differences. In large studies, more than one treatment for all therapies in that study to show a strong improve-
may be involved, each treatment being compared to the ment over the control, whereas when the control rate in a
control. This is particularly true of pharmaceutical stud- study is low, even the best therapies are likely to only show
ies, in which the effects of several drugs or drug doses are a weak effect. In an alternative context, in education, if
compared. Because investigators may have different goals students in a school are very good, it is hard to show
or different resources, different studies may involve dif- much improvement, whereas if the students in a school
ferent treatments, and not all treatments of interest may have poor skills, then substantial improvement is possi-
be used in any particular study. When a meta-analytic ble. Given that this is the case, information about the ef-
synthesis of all studies that involve the treatments of in- fect size for therapy T1 can be used to predict the effect
terest to the researcher is later attempted, the facts that size for any other therapy (for example, T2 ) in the same
effect sizes within studies are correlated, and that some study, and vice versa.
studies may be missing one or more treatments, need to To illustrate the model, suppose that a thorough search
be accounted for in the statistical analysis. of published and unpublished studies using various regis-
ters yields three studies. Although all of these studies use
19.2.1 An Example and a Regression Model a similar control, each study need not include all the treat-
ments. The hypothetical data in table 19.1in which
Suppose that one is interested in combining the results of study 1 used therapies T1 and T2, study 2 used only therapy
several studies of the effectiveness of one or more of three T1, and study 3 involved all three therapiesmake this
hypertension therapies in preventing heart disease in males clear.
or females considered at risk because of excessively high For our illustration, estimated effect sizes will be dif-
blood-pressure readings. For specificity, suppose that the ferences between the proportion of the control group who
therapies are T1  use of a new beta-blocker, T2  use of a experience an occurrence of heart disease and the propor-
drug that reduces low-density cholesterol, and T3  use of tion of those treated who have an occurrence. Later we
a calcium channel blocker. Suppose that the endpoint is discuss alternative metrics to define effect sizes. In this
the occurrence (or nonoccurrence) of coronary heart dis- case, the effect size for each study is:
ease. In every study, the effects of these therapies are
compared to that of using only a diuretic (the control). dT  pC  pT . (19.1)
Although the data for convenience of exposition are hy-
pothetical, they are illustrative of studies that have ap- (Because we expect the treatments to lower the rate of
peared in this area of medical research. heart disease, in defining the effect size, we have sub-
That all therapies are compared to the same control in tracted the treatment proportion from the control propor-
each study suggests that the effect sizes for these thera- tion in order to avoid a large number of minus signs in the
pies will be positively correlated. If the control rate of estimates.) The effect-size estimates for each study are
360 SPECIAL STATISTICAL ISSUES AND PROBLEMS

Table 19.2 Effect Sizes (Differences of Proportions) nore the covariance between dT and dT , we would obtain
the variance of the effect-size difference dT  dT by sum-
1 2
from Table 1
1 2
ming the variances of dT and dT obtaining
Treatment 1 2

Study T1 T2 T3 var (dT  dT )  vr(dT )  vr(dT )


1 2 1 2

pT (1  pT ) pT (1  pT ) pC(1  pC)
 _________  _________  2 _________ .
1 1 2 2

1 0.0150 0.0025 nT nT nC 2
1

2 0.0125
3 0.0350 0.0050 0.0250 That this variance would be too large can be seen by
noting that
source: Authors compilation.
dT  dT  ( pC  pT )  (pC  pT )  pT  pT ,
1 2 1 2 2 1

shown in table 19.2. The dashes in tables 19.1 and 19.2


indicate where particular therapies are not present in a so that actually
study.
The estimated variance of a difference of proportions vr(dT  dT )  vr( pT  pT )  vr( pT )  vr( pT )
1 2 1 2 1 2

pC  pT is pT (1  pT ) pT (1  pT )
 _________  _________ .
1 1 2 2

nT nT
p (1  p ) p (1  p )
1 2

vr ( pC  pT)  _________
C
nC
C
 _________
T
nT
T
, (19.2)
In our hypothetical data, the difference between these
two variance estimates is (relatively) quite large:
where nC and nT are the number of subjects in the control
and therapy groups, respectively. The estimated covari-
var(dT  dT ) 109(44,494  47,423) 109(91,917)
ance between any two such differences pC  pT and pC  1
1 2

pT is equal to the estimated variance of the common con-


2
trol proportion pC, namely, versus

pC(1  pC) (.0250)(.9750) (.0375)(.9625)


cv( pC  pT , pC  pT )  _________
nC . (19.3) vr(dT  dT )  ____________
1 4000 2
 ____________
4000
1 2

 109(15,117)
Notice that the variance and covariance formulas are the
same regardless of whether we define effect sizes as pC  pT
indicating that in such a comparison of effect sizes it
or pT  pC , as long as the same definition is used for all
would be a major error to ignore the covariance between
treatments and all studies.
the effect-size estimates. A large-sample confidence in-
Thus, the following are the estimated covariance ma-
terval for the difference of the two population effect sizes,
trices for the three studies. Note that a covariance matrix
for example, would be excessively conservative if the co-
for a study gives estimated variances and covariances
variance between the effect-size estimates dT and dT were
only for the effect sizes observed in that study. 1 2
ignored, leading to the use of an overestimate of the vari-
ance of dT  dT . On the other hand, if we were interested
 109 44, 494 38, 400 ,
1 2

Study 1: 1 38, 400 47, 423


in averaging the two effect sizes to create a summary ef-
fect size for drug therapy, we would underestimate the
 109 (327, 734), variance of (dT  dT )/2 if we failed to take account of the
Study 2: 2
1
covariance between dT and dT .
2

1 2

73, 088 34, 688 34, 688 The existence of covariance (dependence) between ef-
9 fect sizes within studies means that even when a given

Study 3: 3  10 34, 688 108, 288 34, 688 . (19.4)
study has failed to observe a given treatment, the effect
34, 688 34, 688 82,188 sizes for other treatments that were observed and com-
pared to the common control can provide information
In study 1, we might be interested in comparing the about any given effect size. To clarify this point, let us
effect sizes dT and dT of treatments T1 and T2. If we ig-
1 2
display the estimated effect sizes in a six-dimensional
STOCHASTICALLY DEPENDENT EFFECT SIZES 361

column vector d, where the first two coordinates of d are c11 c12 c13
the effect sizes from study 1, the next coordinate is the
C  c21 c22 c23  ( X 1 X )1
effect size from study 2 and the remaining two coordi-
nates are effect sizes from study 3. Thus, c c32 c33
31

d  (0.0150, 0.0025, 0.0125, 0.0350,  0.0050, 0.0250). 24, 612 19, 954 13, 323
 109 19, 954 28, 538 13, 255 . (19.7)

Let T denote the population effect size for treatment Tj ,
j
13, 323 13, 255 69, 807
j  1,2,3. Then we can write the following regression
model for the vector   (T , T , T ,) of population ef-
1 2 3
Three points are worth noting here. First, observe that
fect sizes: even though treatment T3 appeared only in study 3, the
estimator of the effect size for this treatment that we have
d  X  , (19.5) obtained by regression using equation 19.6 differs from
the effect size for treatment T3 given for study 3 in table
where  is a random vector of the same dimension as d 19.2 (0.0211 versus 0.0250). Note also that the estimated

having mean vector 0 and variance-covariance matrix  variance of the estimate of the effect size for treatment 3
of block diagonal form: that is given in equation 19.7, namely 109 (69,807), is
noticeably smaller than the estimated variance of the ef-
fect size for this treatment in study 3, namely 109
 0 0
1 (82,188). This indicates that information from the esti-

  0

 0,
2
mated effect sizes of other treatments has been used in
0 0
 3
estimating the effect size of treatment T3.
Second, suppose that we ignore the covariances among
estimated effect sizes within studies and use the usual
in which the matrices  , i 1,2,3 are those displayed in weighted combination of estimated effect sizes for a
i
equation 19.4. In the regression model 19.5, the design treatment over studies to estimate the population effect
matrix is size for that treatment. That this estimate will differ from
the one given by equation 19.6 has already been shown
for treatment T3. For another example, consider treatment
1 0 0 T1. The usual weighted estimate of the population effect
0 1 0 size for treatment T1, where the weights are inversely pro-

1 0 0 portional to the estimated variances, is:
X  .
1 0 0
 T
0 1 0
1

(44,494) (0.0150)  (327,734) (0.0125)  (73,088) (0.0350)


1 1 1
 ____________________________________________________
0 0 1 (44,494)  (327,734)  (73,088)
1 1 1

 .0218,
Note that the first two rows of X pertain to study 1, the
next row to study 2, and the remaining three rows to which has (estimated) variance
study 3. Using weighted least squares, we can form the
best linear unbiased estimator vr( )
 109[(44,494)1  (327,734)1  (73,088)1]1

1d  (0.0200, 0.0043, 0.0211)


1X)1X  109(25,505).
  (X
(19.6) Compare this to the estimate  T  0.0200 and estimated
1
variance
of   (T , T , T ). The estimated variance-covariance
1 2 3
matrix of the estimator  is c11  vr(  T )  109(24,612)
1
362 SPECIAL STATISTICAL ISSUES AND PROBLEMS

obtained from fitting the model 19.5, taking account of An estimate of the large-sample variance of the log odds
the covariances among estimated effect sizes in the stud- ratio dT is
ies. Again, the variance of  T is smaller than the variance
1 1
vr(dT)  ___________ ___________
n p (1 p )  n p (1 p )
,
1
of  T . (19.10)
1 C C C T T T
Last, note that the regression model 19.5 assumes that
the population effect sizes for a given treatment are the where nc and nT are the sample sizes of the observed con-
same for all studiesthat is, that effect sizes for each trol and treatment groups, respectively, and pc and pT are
treatment are homogeneous over studies. Of course, this the observed control and treatment proportions. An esti-
need not be the case. However, we can test this hypothe- mate of the large-sample covariance of estimated log odds
sis by computing the test statistic ratios dT and dT of treatment groups T1 and T2 versus a
1 2

common control group C is



Q  (d X) 1(dX) 1d  X
 d  1X
1
cv(dT , dT )  ___________ . (19.11)
 37.608  30.417  17.191. (19.8) 1 n p (1 p )
2
C C C

The statistic Q has a chi-squared distribution with de- A regression model of the form given in section 19.2.1
grees of freedom equal to can now be used to estimate the common effect sizes for
the treatments. For the data in table 19.1, the vector d of
(dimension of d)  (dimension of )  6  3  3 log odds ratio effect sizes is

under the null hypothesis of homogeneous (over studies) d  (0.4855, 0.0671, 0.3008, 0.6657,  0.0700, 0.4321),
effect sizes. The observed value of Q in equation 19.8 is
between the 90th and 95th percentiles of the chi-squared and the estimated large-sample covariance matrix of d is
distribution with three degrees of freedom, so that we can
reject the null hypothesis of homogeneous effect sizes at  0 0
1
the 0.10 level of significance, but not at the 0.05 level. This
  0

 0 ,
result casts some doubt on whether the population effect 2

sizes are homogenous over studies, and consequently we 0 0
 3
will delay illustrating how to construct confidence inter-
vals for, or make other inferences about, the elements of now denotes the matrix of estimated large-sam-
where 
the vector  in the model 19.5 until we consider the alter- i
ple variances and covariances among the estimated log
native effect-size metric of the log odds ratio.
odds ratios for study i, i  1,2,3; namely,

19.2.2 Alternative Effect Sizes


 0.03630 0.02604 ,
Study 1: 
Differences between proportions are not the only way
1 0.02604 0.03297
that investigators measure differences between rates or
proportions. Other metrics that are used to compare rates  0.17453,
Study 2: 2
or proportions are: log odds ratios, log ratios of propor-
tions, differences between the arc sine square root trans- 0.03325 0.00721 0.00721
forms of proportions (the variance-stabilizing metric).  0.00721 0.02079 0.00721 .
Study 3: 
Odds Ratio and Log Odds Ratio The log odds ratio
3

0.00721 0.00721 0.02826
between a treatment group T and a statistically indepen-
dent control group C is
Using the model

[
dT  log _________
(1 pC )pT]
pC (1 pT )
d  X  ,
 log[ pC /(1 pC)]  log[ pT /(1 pT)]. (19.9)
where X is the same design matrix used in section 19.2.1,
 is the vector of common population log odds ratios for
STOCHASTICALLY DEPENDENT EFFECT SIZES 363

the treatments versus the common control, and  is a ran- population parameters is 0.95, then conservative simulta-
dom vector having mean vector 0 and covariance ma- neous 95 percent confidence intervals for the population
the best linear unbiased estimator of  is
trix , effect sizes can be obtained using the Boole-Bonferroni
inequality (Hochberg and Tamhane 1987; Hsu 1996;
1X] 1X
  ( T ,  T ,  T )  [X 1d Miller 1981). In this case, z would be the 100(1/3) 5 
1 2 3
98.33rd percentile of the standard normal distribution,
 (0.5099, 0.0044, 0.4301)
that is, z  2.128. For our present example, a 95 percent
confidence interval for the population effect size for treat-
whose estimated covariance matrix is
ment T1 is
_______ _______
c11 c12 c13 (0.5099  1.960.01412 , 0.5099  1.96 0.01412 )
C  c21 c22 c23  [ X
1 X ]1
 (0.277, 0.743).

c c32
c33
31
Because this interval contains 0, we cannot reject the null
0.01412 0.00712 0.00425 hypothesis at a 5 percent level of significance that the log
 0.00712 0.01178 0.00455 . odds-ratio effect size for treatment T1 equals 0, or equiva-
lently that the odds ratio for treatment T1 versus the con-
0.00425 0.00455 0.02703 trol group equals 1.
Simultaneous 95 percent confidence intervals for the
As before, a test of the model d  X   of homoge- log odds-ratio effect sizes for treatments T1, T2, and T3 are
neous (over studies) log odds-ratio effect sizes yields a _______ _______
test statistic T1 : (0.5099  2.1280.01412 , 0.5099  2.1280.01412 )
 (0.257, 0.763),
1d   X
Q  d 1X  33.177  31.120  2.057
_______ _______
T2 : (0.0044  2.1280.01178 , 0.0044  2.1280.01178 )
that is a little smaller than the median of the chi-squared
distribution with 3 degrees of freedom (the distribution of  ( 0.227, 0.235),
Q under the null hypothesis); that is, the p-value for the _______ _______
null hypothesis is greater than 0.50. Consequently, for T3 : (0.4301 2.1280.02703 , 0.4301 2.1280.02703 )
this effect-size metric for comparing treatment and con-  (0.080, 0.780).
trol proportions, the model of homogeneous effect sizes
appears to fit the data. In such a case, we might be inter- Note that the interval for the log odds-ratio effect size for
ested in forming confidence intervals for the population treatment T3 does not contain 0. Thus we can reject the
effect sizes T , j 1, 2, 3. These intervals have the form
j
null hypothesis that this population effect size is zero.
_______ _______
__ __ Because we used simultaneous confidence intervals for
j j
(  Tj z vr(  T ) ,  T z vr(  T ))(  T zcjj ,  T zcjj ),
j j j
the log odds-ratio effect sizes, the fact that we looked at
(19.12) all three such effect sizes before finding one or more sig-
nificantly different from zero has been accounted for.
where z is a percentile of the standard normal distribu- That is, this method of testing the individual effect sizes
tion. If we are interested in only one of the population controls at 0.05 the probability that at least one effect size
effect sizes, and not in comparison or contrast to the other is declared to be different from zero by the data when, in
population effect sizes, and if we wanted a 95 percent fact, all three population effect sizes equal zero.
confidence interval for this population effect size, then z Confidence intervals for log odds-ratios of the form
needs to be the 97.5th percentile of the standard normal 19.12, whether individual confidence intervals or simul-
distribution, that is, z  1.96. On the other hand, if we are taneous confidence intervals, yield corresponding confi-
interested in joint or simultaneous confidence intervals dence intervals for the odds ratios
for the population effect sizes, such that the probability
that all three confidence intervals contain their respective T  pC (1 pT )/(1 pC)pT
i i i
i  1, 2, 3,
364 SPECIAL STATISTICAL ISSUES AND PROBLEMS

by simply exponentiating the endpoints of the confidence 0.03017 0.00617 0.00617


intervals. For example, the individual 95 percent confi- Study 3: 3  0.00617 0.01767 0.00617 .

dence interval (0.277, 0.743) for the log odds-ratio effect
size for treatment T1 yields the 95 percent confidence in- 0.00617 0.00617 0.02517
terval
Note that for the data of table 19.1, there is very little
(e0.277, e0.743)  (1.319, 2.102) difference between the effect sizes based on log odds ra-
tios and the effect sizes for log ratios, and also little dif-
for T . ference between their respective estimated covariance
i

Ratios and Log Ratios Rather than taking ratios of matrices. This is not surprising considering that all sam-
the odds ratios pC /(1 pC ) and pT /(1 pT ) for the control ple proportions p for the controls and treatments in the
and treatment groups, respectively, investigators often three studies are small, between 0.025 and 0.080, and
find the simple ratio pC /pT to be more meaningful. This thus p and p / (1 p) are nearly equal. The statistical anal-
ratio can be converted to a difference ysis of log ratio (of proportions) effect sizes for this ex-
ample follows the same steps as for log odds ratios and
dT  log(pC /pT)  log pC  log pT (19.13) thus yields similar estimates, test of fit of the model of
homogenous (over studies) population effect sizes, and
and can be treated by the methodology of this chapter. confidence intervals for population effect sizes as those
The estimated large-sample variance of a log ratio (of already given for log odds ratios.
proportions) dT is Differences of Arc Sine Square Root Transforma-
tions of Proportions Still another metric used to define
1 pC 1 pT effect sizes for proportions is based on the variance-sta-
vr(dT)  _____ _____
nC pC  nT pT , (19.14) __
bilizing transformation 2 arcsin(p ), measured in radi-
ans. For a sample proportion p based on a large sample
and the large-sample covariance between two log ratios __
size n, the variance of 2 arcsin(p ) is 1/n. Thus, if the
from the same study is
estimated effect size for comparing a treatment group T
1 pC and an independent control group C is defined to be
cv(dT , dT )  _____
1 2
nC pC . (19.15)
___ __
dT  2 arcsin pC  2 arcsin pT ,
The following are the vector d of effect sizes and the esti-
mated large-sample covariance matrix  of d: 1 1
then the large-sample variance of dT is var(dT)  ___ __
nC  nT ,
independent of the proportions pC and pT. (This indepen-
d  (0.4700, 0.0645, 0.2877, 0.6286, 0.0645, 0.4055) dence of the values of the proportions is particularly im-
portant when one or both of the proportions are equal to
and zero.) Similarly, the large-sample covariance between the
estimated effect sizes for two treatment groups T1 and T2

1
0 0 versus a common independent control group is

 0

 0, 1
2
cov(dT , dT )  ___
nC .
0 0
 3
1 2

The regression analysis now proceeds in the same steps


as with the log odds ratio effect sizes or the difference of
where proportion effect sizes. For the data in table 19.1,

 0.03375 0.02400 ,
Study 1: 
d  (0.0852, 0.0130, 0.0613, 0.1521,  0.0187, 0.1038),
1 0.02400 0.03042
0.00125 0.00100

  0.00750,
, 
 0.15917,
Study 2: 2
1
0.00100 0.00125 2
STOCHASTICALLY DEPENDENT EFFECT SIZES 365

0.00150 0.00050 0.00050 19.3 MULTIPLE TREATMENT STUDIES:


3  0.00050 0.00150 0.00050 ,
STANDARDIZED MEAN DIFFERENCES

0.00050 0.00050 0.00150 A common way of defining effect sizes when the response
variable is quantitative is by the difference between the
so that means of the response for the treatment and for the con-
trol, standardized by the standard deviation of the re-
sponse variable (assumed to be the same for both treat-
  (0.1010, 0.0102, 0.0982),
ment and control groups). Letting T and C be the
population means of the response variable for the treat-
and
ment and control groups, respectively, and letting  be
the common population standard deviation for both
c11 c12 c13 groups, the effect size for treatment T relative to the con-
C c
cv( ) c c23 trol C is defined as
21 22
c c32 c33
31  
T  _______
T

C
. (19.16)
0.00058 0.00040 0.00024
 0.00040 0.00061 0.00025 .
In a study in which nT observations are assigned to treat-
ment T and nC observations are assigned to the control C,
0.00024 0.00025 0.00137 and _in which the sample means for treatment and control
_
are yT and yC , respectively, and the pooled sample stan-
Further, Q  36.06931.805  4.264 is between the 75th dard deviation is s, the usual estimate of T is
_ _
and 80th percentiles of the chi-squared distribution with y y
three degrees of freedom, so that the p-value for the null dT  ______
T C
s . (19.17)
hypothesis of homogeneity of population effect sizes
over studies is between 0.20 and 0.25. (The precise p- If the study is a multiple treatment study, with treatments
value associated with the value of Q here is 0.234.) This Tj , j  1, 2, . . . , p, and a common control, the effect-size
is extremely weak evidence against the null hypothesis, estimates dT will be statistically dependent because they
j

so we might go on to construct individual or simultane- have yC and s in common. If the sample sizes nT , j 1, j

ous confidence intervals for the population effect sizes. 2, . . . , p, and nc are large enough for the central limit
For example, a 90___percent individual confidence interval theorem approximation to hold, then the effect-size esti-
___ mates will be approximately jointly normally distributed
for T  2 arcsin pC  2 arcsin pT is
i 1
with the mean of dT being T , j 1, . . . , p, the (esti-
_______ _______ j j

(0.1010 1.645 0.00058 , 0.1010 1.645 0.00058 ) mated) large-sample variance of dT being j

dT2
 (0.061, 0.141). 1
vr(dTj)  ___ 1
___ _____
nT  nC  2ntotal , j 1, . . . , p,
j
(19.18)
j

The results show that one can come to different con- and the (estimated) large-sample covariance between dT j
clusions about treatment effects relative to a control, de- and dT being
j*
pending on which metric is used to define an effect size
d T dT
for proportions. This is already known for a single treat- 1
cv(dT , dT )  ___ _____
nC  2n , j  j*,
j j*
(19.19)
ment versus a control, so it is not surprising that the j j* total

choice of the metric also affects inferences in multiple


where ntotal  nC  j 1nT is the total sample size of the
k
treatment studies. No matter the metric, it is important to j

take account of the fact that effect sizes will be statisti- study. In distributional approximations concerning the ef-
cally dependent in multiple treatment studies (for a dis- fect sizes, the estimated large-sample variances and co-
cussion of characteristics of these metrics, see chapter 13, variances are treated as if they were the true large-sample
this volume). variances and covariances of the estimated effect sizes.
366 SPECIAL STATISTICAL ISSUES AND PROBLEMS

Comment 1. The large-sample variances and covari- Table 19.3 Summary Data for Four Studies of Exercise
ances estimated in equations 19.18 and 19.19 are derived and Systolic Blood Pressure
under the assumption that the responses are normally dis-
tributed. If the distribution is not normal, the large-sample Intensive Light Pooled
variances and covariances of treatment effect estimates Control Exercise Exercise Standard
will involve the measures of skewness and kurtosis of the yc nc yT nT yT nT s
distribution, and these must be determined (or estimated) 1 1 1 1

if var(dT ) and cov(dT , dT ) are to be estimated.


j j j* Dev. Study
Comment 2. Because the treatments may affect the 1 1.36 25 7.87 25 4.35 22 4.2593
variances of the responses as well as their means, some 2 0.98 40 9.32 38 2.8831
meta-analysts define the population effect size T as the 3 1.17 50 8.08 50 3.1764
difference in population means between treatment and 4 0.45 30 7.44 30 5.34 30 2.9344
control divided by the standard deviation of the control
population. They will then estimate this effect size by source: Authors compilation.
dividing the difference in sample means between treat-
ment and control by the sample standard deviation sC of Table 19.4 Estimated Effect Sizes for the Data in
the control group. If one can nevertheless assume that the Table 19.3 (Computed Using
population response variances are the same for the con- Equation 19.17).
trol and all treatments, then to account for the use of sC in
place of s in the definition of dT one need only replace Intensive Exercise Light Exercise
ntotal by nC in the formulas 19.18 and 19.19. If one cannot
assume that the treatment and control population vari- Study
ances are the same, estimating the large-sample variances 1 2.1670 1.3406
and covariances will require either the sample treatment 2 2,8927
and control variances from the individual studies or infor- 3 2.1754
mation about the population variances of the treatment 4 2.3821 1.6664
and control groups (Gleser and Olkin 1994, 346; see also source: Authors compilation.
2000).
Comment 3. The definition of effect size in equation
19.16 is the usual one. If we expect control means to ex- medical, but many similar examples are available in the
ceed treatment means, as in section 19.2, then to avoid a social sciences.
large number of negative estimates we can instead define
the effect size of a treatment by subtracting the mean of 19.3.1 Treatment Effects, An Example
the treatment population from the mean of the control
populationthus, T  (C  T)/. As long as this is Suppose that a survey of all randomized studies of changes
done consistently over all studies and over all treatments, in systolic blood pressure due to exercise had found four
the formulas 19.18 and 19.19 for the estimated variances such studies and identified two types of treatment used in
and covariances of the effect sizes remain the same. those studies: intensive exercise, such as running for thirty
When several multiple treatment studies are the focus minutes; light exercise, such as walking for twenty min-
of a meta-analysis, and particularly when not all treat- utes. All four studies had considered intensive exercise in
ments appear in all studies, the regression model of sec- comparison to an independent nonexercise control group,
tion 19.2 can be used to estimate treatment effects, as- but only two had included a group who did light exercise.
suming that such effects are common to all studies (each The studies were not large, but the sample sizes were
treatment has homogeneous effect sizes across studies). large enough that the sample means could be regarded
Also, the statistic Q introduced in section 19.2 can be as having approximately normal distributions. Table 19.3
used to test the assumption of homogeneity of effect contains summary datasample sizes, sample means,
sizes. We illustrate such an analysis in the example that pooled sample standard deviationsfor the four studies.
follows. Again, the example is hypothetical, though it is The response variable was the reduction in systolic blood
inspired by meta-analyses in the literature. The context is pressure after a three-month treatment. Table 19.4 pres-
STOCHASTICALLY DEPENDENT EFFECT SIZES 367

ents the estimated effect sizes for each treatment in each of common, cross-study population effect sizes for treat-
study. ments T1 and T2 is
As in section 19.2, we create a vector d of estimated
effect sizes: 1X]1X
  ( T ,  T ) [X 1d  (2.374, 1.570)
1 2

d  (2.1670, 1.3406, 2.8927, 2.1754, 2.3821, 1.6664), and the estimated covariance matrix of  is

where the first two elements in d pertain to study 1, the c c 11 X ]11  0.02257 0.01244 .
C  11 12  [X
 0.01244 0.03554
next element pertains to study 2, the next to study 3, and 21 22
c c
the last two elements in d are from study 4. A model of
(19.21)
homogeneous effect sizes (over studies) yields the fol-
lowing regression model for d: Proceeding to test the goodness-of-fit of the model
19.20, that is, of homogeneity of population effect sizes
d  X  , (19.20) for treatments over the studies, we find that
where   (T , T ) is the vector of population effect sizes
1 2
1d X
Q  d  1 X  256.110  252.165  3.945.
for treatments T1 (intensive exercise) and T2 (light exer-
cise) , the design matrix for the regression is Under the null hypothesis 19.20, Q has a chi-squared
distribution with degrees of freedom  dimension of d 
1 0 dimension of   6  2  4.
0 1 Because the observed value of Q is the 59th percentile
of the chi-squared distribution with 4 degrees of freedom,
1 0 ,
X  the model 19.20 appears to be an adequate fit to the data.
1 0
1 0 19.3.2 Test for Equality of Effect Sizes and

0 1 Confidence Intervals for Linear
Combinations of Effect Sizes
and  has a multivariate normal distribution with vector A natural question is whether the population effect sizes
of means equal to 0 and (estimated) covariance matrix for the treatments are equal; if not, we might want to con-
struct a confidence interval for the difference T  T . It

 0
1 2

11 0 0 is better to construct the confidence interval first and then


use this to test whether T  T . Note that
 0 

 22 0 0 1 2


 .
0 0 33 0 vr(  T   T )  vr(  T )  vr(  T )  2cv(  T ,  T )
1 2 1 2 1 2

0 4


0 0 4  c11  c22  2c12 .

Here, making use of equations 19.18 and 19.19, Thus, a 95 percent approximate confidence interval for
T  T is
1 2

 0.11261 0.06017 ,
______________
  0.10496,
  T   T 1.96 (c11  c22  2c12)
1 0.06017 0.09794 2 1 2

or
 0.09819 0.05539 ,
 0.06366, 
 3 4 0.05539 0.08209 ____________________________
2.374 1.570 1.96 [0.02257  0.03554  2(0.01244)] ,

Fitting the regression model 19.20 by weighted least which yields the interval (0.447, 1.161). Because this in-
squares as in 19.6, we find that the estimate of the vector terval does not contain 0, we can reject the null hypothesis
368 SPECIAL STATISTICAL ISSUES AND PROBLEMS

that T  T at level of significance 0.05. Based on this


1 2
we must pay a price in terms of the lengths of the confi-
hypothetical analysis, intensive exercise is more success- dence intervals. Thus, individual 95 percent confidence
ful than light exercise at reducing systolic blood pressure intervals for T and T are
1 2

by between 0.447 and 1.161 standard deviations. _______


An alternative way of performing the test of H0: T  T : 2.374 1.96 0.02257
1
or 2.374  0.294,
1 _______
T would be to fit the regression model
2
T : 1.570 1.96 0.03554 or 1.570  0.369,
2

d  (1,1,1,1,1,1)  , and simultaneous 95 percent confidence intervals for T


(19.22) and T using the Boole-Bonferroni inequality are
1

  common value of T and T , 1 2


2

_______
using weighted least squares and test this model against T : 2.374  2.2414 0.02257
1
or 2.374  0.337,
_______
the alternative hypothesis 19.20. To do so, we compute T : 1.570  2.2414 0.03554 or 1.570  0.423.
the test statistic 2

In comparison, using the simultaneous confidence inter-


L  lack of fit of model (19.22)  lack of fit of val procedure 19.23 would yield
model (19.20) _______
 [d 1d   1(1,1, . . . ,1)]
2[(1,1, . . . ,1) T : 2.374  2.4477 0.02257
1
or 2.374  0.368,
_______
 [d 1d X
 1X] T : 1.570  2.4477 0.03554
2
or 1.570  0.461.

X 1X   1(1,1, . . . ,1)]
2[(1,1, . . . ,1) The difference of interval lengths is noticeable even here,
where the number of effect sizes being considered is
 252.165  232.705 19.460, small.
which, under the null hypothesis H0: T  T , has a chi- 1 2
squared distribution with degrees of freedom equal to di- 19.4 MULTIPLE-ENDPOINT STUDIES:
mension of   dimension of   2 1 1. The p-value STANDARDIZED MEAN DIFFERENCES
of L for the null hypothesis is essentially 0, so that we can
Multiple-endpoint studies involve a treatment group T and
reject the null hypothesis of equal effect sizes at virtually
an independent control group C. (In some studies, another
any level of significance.
treatment may be used in place of a control.) On every
As in section 19.2, we can construct 95 percent simul-
subject in a study, p variables (endpoints) Y1, Y2, . . . , Yp
taneous confidence intervals for T and T using the 1 2
are measured.
Boole-Bonferroni inequality. However, we might also be
For both the treatment (T) group and the control (C )
interested in confidence intervals for T  T and (T  1 2 1
group, it is assumed that the joint distribution of Y1,
T )/2. To obtain simultaneous 100(1 ) percent confi-
2
Y2, . . . , Yp is multivariate normal with a common matrix
dence intervals for these quantities, and for all other lin-
  (jj*) of population variances and covariances, but
ear combinations aT  bT of T and T , we form the
1 2 1 2
possibly different vectors C  (C1, C2, . . . , Cp), T 
intervals
(T 1, T 2, . . . , Tp), of population means.
_____ ___________________
The population effect size for the jth endpoint (vari-
a  T  b  T 2;1
2
(a2c11  2abc12  b2c22) , (19.23)
1 2
able) Yj for a multiple-endpoint study is defined as
where C  (cjj*)  cv() is the estimated covariance ma- Cj  Tj
j  ________
___ ,
 j  1, . . . , p, (19.24)

trix of   (T , T ) given by 19.21 and 22;1 is the
jj
1 2
100(1 ) percentile of the chi-squared distribution with where again we take differences as control minus treat-
2 degrees of freedom. The probability that all intervals of ment because we expect the means corresponding to the
the form 19.23, for all choices of the constants a and b, control to be larger than the means corresponding to the
will simultaneously cover the corresponding linear com- treatments, and we want to avoid too many minus signs in
binations aT  b T of the population effect sizes is
1 2
the estimated effect sizes. Let   (1, . . . , p) denote
1 . To achieve simultaneous coverage of the confi- the population effect-size vector for such a study. The
dence intervals for all linear combinations of T and T , 1 2
usual estimate of j would be
STOCHASTICALLY DEPENDENT EFFECT SIZES 369
__ __
YCj  YT j large kp-dimensional vector of estimated effect sizes, and
dj  _______
__ ,
s (19.25) the estimated covariance matrices will be diagonal blocks
jj
__ __ of a combined kp  kp estimated covariance matrix. This
where Y Cj, Y Tj are the sample means for the j-th endpoint can be done even if not all studies measure all p possible
from the control group and treatment group, respectively, endpoints. In this case, the dimension p(i) of the row vec-
and sjj is the pooled sample variance for the j-th endpoint. tor d(i ) of estimated effect sizes for the endpoints mea-
Because the measurements YC1,YC2 , . . . , YCp of the end- sured in the i-th study will depend on the study index i, as
points for any subject in the control group are dependent, will the dimension of the p(i)  p(i) estimated large-sam-
as are those of the endpoints for any subject__in the__ treat- ple covariance matrix (i) of d(i), for i 1, . . . , k. In
ment group, so will the differences of means YC j  YT j and general, then, the combined estimated effect-size column
the pooled sample variances sjj, j 1, . . . , p. The effect- vector d  (d(1), . . . , d(k)) will have dimension p* 1,
size estimates d1, d2 , . . . , dp will in turn also be depen- and the corresponding block-diagonal estimated large-
dent. The covariances or correlations among the djs will sample covariance matrix
similarly depend on the covariances or correlations
among the endpoint measurements for subjects in the  (i) 0
(1) 0
control or treatment groups.
 0
Unfortunately, not all studies will report sample co-  O 0 ,
variances or correlations. This forces us to impute values 0 (k)
0 (k)

for the population covariances or correlations from other
sources, either other studies or, when the endpoints are will have dimension p*  p*, respectively, where p* 
subscores of a psychological test, from published test k
i 1 p(i).
manuals. It is most common for sample or population We can now write the model of homogeneous (over
correlations to be reported, rather than covariances, so we studies) effect sizes as the regression model
will give formulas for the large-sample variances and co-
variances of the djs in terms of correlations among the d  X  , (19.28)
endpoint measures.
Under the assumption that the endpoint measures for where  has a p*-dimensional normal distribution with
the control and treatment groups have a common matrix mean vector 0 and covariance matrix , and where the
of variances and covariances, and that no observations design matrix X has elements xij which equal 1 if end-
are missing, the large-sample variance of dj can be esti- point j is measured in study i, and otherwise equal 0. As
mated by before, the vector  of common (over studies) population
1 ___
 vr(d )  __ 1 dj 2
effect sizes for the p endpoints is estimated by weighted
jj j nT  nC  2(n  n )
_________ (19.26) least squares
T C

as already noted in section 19.3, equation 19.18. The large- 1X )1X
  ( 1, . . . ,  p)  (X 1d, (19.29)
sample covariance between dj and dj* can be estimated by
and has estimated covariance matrix
 __
 jj* (
1 ___1
) (
dj dj*
nT  nC  jj*  2(nT  nC)  jj* ,
_________ 2
) j  j*, (19.27)
1X)1.
C  (cjj*)  (X (19.30)
where  jj* is an estimate of the correlation between Yj and
Yj* . When the sample covariance matrix of the endpoint A test of the model of homogeneous effect sizes 19.28
measures for the study is published,  jj* can be taken to be against general alternatives can be based on the test sta-
the sample correlation, rjj*. Otherwise,  jj* will have to be tistic
imputed from other available information, for example,
from the sample correlation for endpoints j and j* taken 
Q  (d  X) 1(d  X) 1d X
 d  1X
from another study. As before, each study will have its (19.31)
own vector d  (d1, . . . , dp) of effect-size estimates and
corresponding estimated large-sample covariance matrix. whose null distribution is chi-squared with degrees of free-
The vectors from the k studies can be combined into one dom equal to the dimension p* of d minus the dimension
370 SPECIAL STATISTICAL ISSUES AND PROBLEMS

Table 19.5 Effect of Drill Coaching on Differences Between Post-Test and Pre-Test
Achievement Scores

Means
Pooled
Control Treatment Standard Dev.

n Math Reading n Math Reading Math Reading Correlation 

School
1 22 2.3 2.5 24 10.3 6.6 8.2 7.3 0.55
2 21 2.4 1.3 21 9.7 3.1 8.3 8.9 0.43
3 26 2.5 2.4 23 8.7 3.7 8.5 8.3 0.57
4 18 3.3 1.7 18 7.5 8.5 7.7 9.8 0.66
5 38 1.1 2.0 36 2.2 2.1 9.1 10.4 0.51
6 42 2.8 2.1 42 3.8 1.4 9.6 7.9 0.59
7 39 1.7 0.6 38 1.8 3.9 9.2 10.2 0.49

source: Authors compilation.

p of . When all p end points are measured in each of the k the possible effects of coaching on SAT-Math and SAT-
studies, the degrees of freedom of Q under the null hypoth- Verbal scores (Becker 1990; Powers and Rock 1998, 1999).
esis 19.28 is equal to p(k1). An example illustrating these A group of seven community colleges have instituted sum-
calculations appears in the following section (19.4.1). mer programs for high school graduates planning to at-
Comment. Our discussion has dealt with situations tend college. One of the stated goals of the summer pro-
where certain endpoints are not measured in a study, so gram is to improve the students skills in mathematics
that all observations on this endpoint can be regarded as and reading comprehension. There is discussion as to
missing. Because such missingness is a part of the de- whether coaching by drill (the treatment) is effective for
sign of the study, with the decision as to what endpoints this purpose, as opposed to no coaching (the control).
to use presumably made before the data were obtained, Consequently, each student enrolled for the summer pro-
this type of situation is not the kind of missing data of gram is randomly assigned to either coaching or non-
concern in the literature (see, for example, Little and coaching sessions. At the beginning of the summer ses-
Rubin 1987; Little and Schenker 1994). It is, of course, sion, every student is tested in mathematics and reading
not unusual for observations not to be obtained on a sub- comprehension. Later, their scores in mathematics and
set of subjects in a study. If such observations are missing reading comprehension on a posttest are recorded. There
because measurements are missing, bias in the statistical are p  2 endpoints measured on each subjectnamely,
inference is possible. The meta-analyst can generally do posttest minus pretest scores in mathematics and in read-
little about such missing data because original data are ing comprehension. Sample sizes, sample means, and
seldom accessible. Whatever methodology used by a study pooled sample variances and correlations are given for
to account for its missing data will thus have to be ac- these measurements in table 19.5. Note that the sample
cepted without adjustment. This is a potential problem, sizes are not the same, even within a particular summer
but a few such missing values are not likely to seriously program at a particular community college, because of
affect the values of estimates or test statistics in the large- dropouts. Only those students who completed the sum-
sample situations where these methods can be applied. mer session and actually took the posttest are included
in the data. All students therefore had measurements on
19.4.1 Missing Values, An Example both endpoints.
Using equations 19.25 through 19.27, we obtain table
Consider the following hypothetical situation, which is 19.6, which for each school gives the estimated effect
based loosely on published meta-analyses that deal with sizes for mathematics and reading comprehension attrib-
STOCHASTICALLY DEPENDENT EFFECT SIZES 371

Table 19.6 Estimated Effect Sizes for Drill Coaching and Their Estimated Covariance
Matrices

Estimated Variance and Covariances


Effect Sizes of the Effect Size Estimates

dmath dreading vr(dmath) vr(dreading) cv(dmath, dreading)

School
1 0.976 0.562 0.0975 0.0906 0.0497
2 0.880 0.202 0.1045 0.0957 0.0413
3 0.729 0.157 0.0874 0.0822 0.0471
4 0.545 0.694 0.1152 0.1177 0.0756
5 0.121 0.010 0.0542 0.0541 0.0276
6 0.104 0.089 0.0477 0.0477 0.0281
7 0.011 0.324 0.0520 0.0526 0.0255

source: Authors compilation.

utable to drill coaching, and also the (estimated) large- the groups. This conjecture corresponds to a regression
sample matrix of variances and covariances for the esti- model of the form
mated effect sizes.
Note that there is considerable variation (over schools) d  U  , (19.32)
in the estimated effect sizes for both mathematics and
reading comprehension in table 19.6. Using the test sta- where
tistic Q for testing the null hypothesis 19.28 of homoge-
d  (0.976, 0.562, 0.880, 0.202, . . . . , 0.011, 0.324)
neous population effect sizes for both mathematics and
reading comprehension, we find that Q  19.62. This is the fourteen-dimensional column vector of mathematics
value is the 92.5th percentile of the chi-squared distribu- and reading comprehension effect sizes (in that order) for
tion with p(k1)  2(71)  12 degrees of freedom (the schools 1 through 7,   ( 1, math, 1, reading, 2, math, 2, reading),
null hypothesis distribution of Q). This result casts doubt and  has approximately a multivariate normal distribu-
on the model of homogeneity of effect sizes. One might tion with mean 0 and (estimated) block diagonal vari-
look for explanations of the differences in effect sizes. It ance-covariance matrix
is the conjectured explanations of heterogeneity of effect
sizes, supported by, or at least not contradicting, the data
and pointing to directions for future study, that are often 1 L 0
the greatest contribution of a meta-analysis to scientific   M O M Diag(
1 ,K,
7 ),

research. 0 M 7

19.4.2 Regression Models and whose diagonal 2  2 blocks are given in table 6. For ex-
Confidence Intervals ample,
Suppose it is known that schools 1 through 4 and schools

0.0874 0.0471
5 through 7 in the example of section 19.4.1 draw stu- 3 0.0471 0.0822
dents from different strata of the population, or that they
used different methods for applying the treatment (the
drill). It might then be conjectured that effect sizes for is the estimated variance-covariance matrix of the esti-
mathematics and reading comprehension are homoge- mated effect sizes from school 3 and can be found in the
neous within each group of schools, but differ between last three columns of the row for school 3 in table 19.6.
372 SPECIAL STATISTICAL ISSUES AND PROBLEMS

The design matrix U is Simultaneous 95 percent confidence intervals for all lin-
ear combinations
I 0
I 0 a  a1 1, math  a2 1, reading  a3 2, math  a4 2, reading

I 0
of the components of  are given by
U  I 0 , _____ _____
0 I a  4:0.95
2
aCa , for every vector a.

0 I
Here, 4:0.95
2 is the 95th percentile of the chi-squared
0 I distribution with 4 degrees of freedom and equals 9.488.
Another possible model to explain the heterogeneity of
where I is the two-dimensional identity matrix and 0 is a effect sizes observable in table 19.6 would relate the ef-
2  2 matrix of zeroes. Note that U is a 14  4 matrix. To fect sizes to a quantitative characteristic W of the schools
test the null hypothesis 19.32, we use the test statistic whose values for schools 1 through 7 are w1, . . . , w7 . For
example, the characteristic W might be the length of time
Q  min(d  U) 1(d  U)
spent in drill per session. A model that posits separate
1d  U
 d  1U,
(19.33) linear relationships between W and the population effect
sizes for mathematics and for reading would be
where  is the (optimal) weighted least squares estimator
d  W  , (19.36)
1U )1U
  (U 1d. (19.34)
where d and  are the same vectors appearing in equation
Under the null hypothesis 19.32, Q has an approxi- 19.32, but now the elements of the 4  1 vector  (
0, math,
mate chi-squared distribution with degrees of freedom 
1, math,
0, reading,
1, reading) are the intercepts (subscripted 0)
dimension of d  dimension of   144  10. The value and slopes (subscripted 1) of the linear relationship be-
of Q for the data in table 6 is Q  8.087, which is the tween the effect sizes and the predictor W. The 14  4
45th percentile of the chi-squared distribution with 10 de- design matrix W has the form
grees of freedom. There is thus not enough evidence in
the data to reject the model of two homogeneous groups 1 w1 0 0
of schools. 0 0 1 w1
If the individuals analyzing the effect-size data decide
1 w2 0 0
to use the model 19.32 of two homogeneous groups of
schools and then summarize their findings, they might W  0 0 1 w2 .
want to form confidence intervals for the components of M M M M
  ( 1, math, 1, reading, 2, math, 2, reading ) or for such linear com-
1 w7 0 0
binations of these components as 1, math  2, math. The 0 0
large-sample (estimated) covariance matrix of the esti- 1 w7
mator 19.34 is given by
We can fit and test the model 19.36 by the weighted least
1U)1.
C  (cij)  (U (19.35) squares methods already described for the model 19.32.
Thus, a test of the model 19.36 is based on the statistic
One can now proceed to construct confidence intervals
1(d  W )
Q  min (d  W )
using the methods described in section 19.2. For exam-
ple, an individual 95 percent confidence interval for the 1d  W
 d 1W ,
 (19.37)
common mathematics population effect size 1, math in
schools 1 through 4 is where
___
1, math 1.96 c11 ; 1W)1W
 (W 1d, (19.38)
STOCHASTICALLY DEPENDENT EFFECT SIZES 373

and the null distribution of Q is the chi-squared distribu- 1d  d 


Q  d1 1d  . . .  d  1d
1 1 2 2 2 k k k
tion with degrees of freedom equal to the difference be-  1   1  . . .  
1).

 ( (19.42)
tween the dimensions of d and of . Here, that difference 1 2 k

equals 14  4  10. Linear combinations of the elements


In the case of model 19.32, one can use formulas 19.40
of that we might want to estimate are
and 19.41 to estimate the vectors of population effect size
for each group based on the effect-size vector di and esti-
(1)
0, math  (w)
1, math, (1)
0, reading  (w)
1, reading, from that group.
mated large-sample covariance matrix  i
These estimates give the same result as 19.34. The esti-
which are the population effect sizes for drill for the math-
mated covariance matrix of such estimators is block di-
ematics and reading comprehension parts, respectively,
agonal, with the covariance matrix for the estimated ef-
of the achievement test when the characteristic W of a
fect-size vector of group 1 as the first diagonal block and
school has the value w. Simultaneous 95 percent confi-
the covariance matrix for the estimated effect-size vector
dence intervals (over all values of w) for these population
of group 2 as the other diagonal block.
effect sizes are
_____ ________________
(1)
0, math  (w)
1, math  4:0.95
2
c11  w2c22  2wc12 , 19.4.3 When Univariate Approaches Suffice
_____ ________________ Let ij be the effect size of the j-th endpoint in the i-th
(1)
0, reading  (w)
1, reading  4:0.95
2
c33  w2c44  2wc34 , study and let dij represent its estimate based on the data in
the i-th study, i  1, . . . , k; j  1, . . . , p. Further, let i;jj
where be its esti-
be the large-sample variance of dij and let  i;jj
1W)1
C  (cjj*)  (W (19.39) mate. These quantities are the same regardless of whether
j is the only treatment or endpoint measure considered in
is the (estimated) large-sample covariance matrix of and study i, or is part of a multiple-treatment or multiple-end-
we have already noted that the value of 4:0.95
2 is 9.488. point study. The temptation thus exists to ignore correla-
Comment. In the example of section 19.4.1, all end- tions between the effect-size estimates and to assume that
points are measured on all subjects. In such situations, they are independent. In this case, one can view the effect-
the general formulas in 19.29 through 19.31, and in 19.33 size estimates dij as arising from kp independent studies
through 19.35, can be greatly simplified. Let di be the col- of the treatment versus control type and use univariate
umn vector of estimated effect sizes from study i, i  1, methods to compare the corresponding effect sizes ij ei-
represents the estimated large-
2, . . . , k. Recall that  ther across studies (over i) or across treatments or end-
i
sample variance-covariance matrix of di. Then the for- points (over j).
mula 19.29 for the estimate  of the vector of common Using univariate methods to compare effect sizes ij
population effect sizes is across studies for a single treatment or single endpoint j
is a frequently used approach, and provides correct confi-
1 
  ( 1  . . .  
1)1 dence intervals and has correct stated probabilities of
1 2 k
error or non-coverage. Even so, such procedures may not
1
( d   d  . . .  
1 1
dk ), (19.40)
1 1 2 2 k be optimally accurate or efficient because they do not
make use of statistical information about the errors of es-
which can be viewed as the vector-matrix generalization timation dij  ij in the other effect sizes. As we showed in
of the optimal weighted linear combination of scalar ef- section 19.2, multivariate procedures take account of such
fect sizes (see chapter 14, this volume)that is, the ef- information and thus are generally more statistically ef-
fect-size vectors are weighted by the inverses of their vari- ficient.
ance-covariance estimates. Similarly, the formula 19.30 Examples of across-study inferences for a specified
for the estimated covariance matrix of  simplifies to treatment or endpoint j that can be validly performed
1 
1  . . . 
1)1, using univariate methods include
C  (cjj*)  ( 1 2 k (19.41)
1. tests of the null hypothesis Hoj:1j  2j  . . .  kj
and the goodness-of-fit test statistic Q for the model 19.28 of homogeneity of effect sizes over studies, and fol-
becomes low-up multiple comparisons;
374 SPECIAL STATISTICAL ISSUES AND PROBLEMS

2. tests of the null hypothesis that an assumed com- requirement is that the vector d of all estimated effect
mon effect size (over studies) is equal to 0, or con- sizes in all studies considered has an approximate multi-
fidence intervals for this common effect size; variate normal distribution with mean vector equal to the
3. if effect sizes over studies are not believed to be vector of corresponding population effect sizes  and a
homogeneous, goodness of-fit tests of alternative covariance matrix  that can be estimated accurately
enough that its estimator  can be treated as if it were
models relating such effect sizes (for the endpoint
or treatment j) to certain study characteristics (co- the known true value of . That this is possible in the
variates). special cases we considered follows from standard large-
sample statistical theory. It should be noted, however,
Even so, when more than one treatment or endpoint is that in this theory the number p* of estimated effect
being considered, the meta-analyst should be aware that sizesthat is, the dimension of dis assumed to be
any of the inferential procedures 1, 2 or 3 for one endpoint fixed and the sample sizes within studies are assumed to
or treatment j can be statistically related to the results of be large relative to p*. This type of approximation for
one of these procedures done for another endpoint or treat- meta-analysis was originally discussed in Hedges and
ment j*. Consequently, it should not be surprising to see Olkin (1985).
clusters of significant tests across treatments or endpoints. Although the difference in sample proportions dis-
In contrast, using univariate methods to compare effect cussed in section 19.2.1 is an unbiased estimator of its
sizes ij over endpoints or treatments within studies can population value, this is not the case for the other effect
lead to serious error. We saw this in section 19.2, where sizes discussed in section 19.2 or for those in sections
treating effect sizes as being independent (as is assumed 19.3 and 19.4. Similarly, again with the exception of the
in univariate methods) led to underestimates or overesti- difference in sample proportions, the estimates of vari-
mates of the variances of certain linear combinations of ances and covariances of effect sizes are not unbiased.
the effect sizes. Unbiased estimates can be found (Hedges and Olkin
1985). However, for the theory underlying the regres-
19.5 CONCLUSION sion methodology given in this chapter to hold, it would
be more important to modify effect-size estimates and
The main goal of this chapter is to introduce two common the estimates of their variances and covariances to im-
situations in meta-analysis in which several population ef- prove the normal distributional approximation. Some
fect sizes are estimated in each study, leading to statistical univariate studies on the goodness of the normality ap-
dependence among the effect size estimates. In one case, proximation for various transformations of standardized
several treatments may be compared to a common control differences of sample means are given in Hedges and
in each study (multiple treatment studies). The depen- Olkin (1985). These may give insights for the multivari-
dence arises because each estimated effect size in a given ate case.
study contains a contribution from the same control. In As illustrated in sections 19.2 and 19.3, one of the ad-
the other case (multiple endpoint studies), a treatment and vantages of the regression approach is that it enables the
a control may be compared on multiple endpoints. Here meta-analyst to treat the often-encountered situation in
the dependence arises because of the dependence of the which not all population effect sizes of interest are ob-
endpoints when measured on the same subject. In these served in all studies. This meant in sections 19.2 and 19.3
cases, and in general whenever there is statistical depen- that not all treatments were observed in all studies. In sec-
dency among estimated effect sizes, ignoring such depen- tion 19.4, it meant that not all endpoints were measured
dencies can lead to biases or lack of efficiency in statisti- in all studies.
cal inferences, as illustrated in section 19.2.1. The applicability of regression models of the form
One methodology that can be applied to account for d  X  , where  has an (approximate) normal distri-
statistical dependency in effect sizes was illustrated in bution with (approximated, estimated) covariance ma-
sections 19.2 through 19.4namely, the use of a regres- trix , extends beyond endpoints based on proportions
sion model and weighted least squares estimation of its or standardized mean differences as treated in sections
parameters, followed by tests of fit of the model (or of 19.2 through 19.4. For example, this approach can be ap-
submodels) and construction of individual or simultane- plied to correlations  between behavioral measures X
ous confidence intervals for these parameters. The main and Y compared under no treatment (control) and under a
STOCHASTICALLY DEPENDENT EFFECT SIZES 375

variety of interventions (multiple treatments) when the If we are interested in comparing correlations between
correlations are obtained from independent samples of a response Y and various covariates X1, . . . , Xp measured
subjects. In this case, one possible effect size is the differ- on the same individual under a treatment and a control (as
ence dT  rT  rC between the sample correlation rC from in multiple endpoint studies), we can determine large-
the control context and the sample correlation rT under a sample variances and covariances of the estimated effect
treatment context. Assume that X and Y are jointly normal sizes dT  rT  rC . However, constructing estimates of the
and that sample sizes are greater than fifteen, so that dT large-sample covariance matrix  of the vector d of esti-
has approximately a normal distribution with (estimated) mated effect sizes in this situation is more difficult. It is
variance especially so if studies neglect to report sample correla-
tions between the covariates, because nave plug-in esti-
(1 r 2)2 (1 r 2)2
vr(dT)  _______
nT  nC
T _______
C .
mates of this matrix need not be positive definite.
Still another example of the application of the regres-
Further, the estimated effect sizes dT and dT for treat- sion methodology involves multiple endpoint studies
1 2

ments T1 and T2 within the same study have (estimated) where the responses are categorical (qualitative). Here,
covariance two multinomial distributions are comparedone under
the control and the other under the treatment. If the effect
(1 r 2 )2 sizes (differences in proportions between control and
cv(dT1, dT 2)  _______
C
nC .
treatment for the various categories) equal 0, then this is
Putting all of the estimated effect sizes into a column equivalent to saying that the response is independent of
vector d and using the cited information (plus the inde- the treatment assignment. The original test for this hy-
pendence of the studies) to construct the (estimated) co- pothesis was the classic chi-squared test of independence.
variance matrix  of d, now puts the meta-analyst in a It can be applied to situations that are the combination of
position to estimate, compare, and test models for the multiple treatment and multiple endpoint studies, and si-
population effect sizes. As noted, this can be done even if multaneous confidence intervals can be given for the ef-
not all treatments are contained in every study. fect sizes that are compatible with the test of independence
An alternative metric for correlations is Fishers z- (Goodman 1964). The test in its classical form. however,
transformation, can be applied only to cases where every study involves
every treatment. When this is not the case, the methodol-
ogy given in this chapter is still applicable and provides
z  (1/2)log((1 r) / (1 r)),
similar inference (see Trikalinos and Olkin 2007).
In sections 19.2 and 19.4, we exclusively discussed
whose large sample distribution is normal with mean
situations where the considered studies are assumed to be
 (1/2)log((1 )/(1 )) and variance 1/n, where n is
statistically independent. However, more than one study
the sample size. If we are in the independent correlations
that a meta-analyst includes can involve the same group
context of the previous paragraph, but now define the
of subjects, or can come from the same laboratories (in-
estimated effect size for treatment context T as the dif-
vestigators). This creates the possibility of statistical de-
ference
pendencies between estimated effect sizes across studies.
If good estimates of the covariances can be obtained, the
dT  zT  zC  (1/2)log((1 rT )/(1 rT ))
regression methodology illustrated in sections 19.2
 (1/2)log((1 rC )/(1 rC )) through 19.4 can be applied.
Finally, a word should be said about computation. If
between the z-transformations of the correlations ob- values of the vector d of estimated effect sizes, the design
tained under context T and the control context C, then the matrix X of the regression model, and the covariance
large-sample distribution of dT is normal with mean T  matrix  are input by the meta-analyst, many computer
T  C and variance var(dT)  1/nT 1/nC , and the large- software programs for regression (for example, SAS,
sample covariance between two such estimated effect SPSS, Stata) can find weighted least squares estimates of
sizes dT and dT is cov(dT , dT )  1/nC. These results, as
1 2 1 2
the parameter vector  of the regression model d  X  
before, allow us to make inferences concerning the popu- and will provide the value of the goodness-of-fit statistic
lation effect sizes. Q  min(d  X)  1(d  X) for that model (usually
376 SPECIAL STATISTICAL ISSUES AND PROBLEMS

under the label sum of squares error). Some will also pro- Health Policy, edited by Donald Berry and Darlene Stangl.
vide individual or simultaneous confidence intervals for New York: Marcel Dekker.
(linear combinations of) the elements of . The calcula- Goodman, Leo A. 1964. Simultaneous Confidence Inter-
tions given in this chapter, however, were done using only vals for Contrasts Among Multinomial Populations. An-
the vector-matrix computation capabilities of the statisti- nals of Mathematical Statistics 35(2): 71625.
cal software Minitab. One could also use Matlab, its Hedges, Larry V., and Ingram Olkin. 1985. Statistical Mo-
open-source equivalent Octave, or any other software ca- dels for Meta-Analysis. New York: Academic Press.
pable of doing vector-matrix computations. Difficulties Hochberg, Yoav, and Ajit C. Tamhane. 1987. Multiple Com-
in using any such computer software may arise when the parison Procedures. New York: John Wiley & Sons.
total number of effect sizes under consideration is large Hsu, Jason C. 1996. Multiple Comparisons. Theory and
(over 200), primarily due to the need for inversion of the Methods. New York: Chapman and Hall.
covariance matrix . It is thus worth noting that when Little, Richard J.A., and Donald B. Rubin. 1987. Statistical
the studies under consideration are independent,  is Analysis With Missing Data. New York: John Wiley & Sons.
block diagonal, and the inverse of this matrix can be Little, Richard J.A., and Nathaniel Schenker. 1994. Missing
found using inverses of its diagonal blocks, thus greatly Data. In Handbook for Statistical Modeling in the Social
reducing the dimensionality of the computation. and Behavioral Sciences, edited by Gerhard Arminger,
Clifford C. Clogg, and Mike E. Sobel. New York: Plenum.
19.6 ACKNOWLEDGMENT Miller, Rupert G. Jr. 1981. Simultaneous Statistical Infer-
ence, 2d ed. New York: Springer-Verlag.
The second author was supported in part by the National Powers, Donald E., and Donald A. Rock. 1998. Effects of
Science Foundation and the Evidence-Based Medicine Coaching on SAT I: Reasoning Scores. College Board Re-
Center of Excellence, Pfizer, Inc. search Report 986. New York: College Board.
. 1999. Effects of Coaching on SAT I: Reasoning
19.7 REFERENCES Test Scores. Journal of Educational Measurement 36(2):
93111.
Becker, Betsy J. 1990. Coaching for the Scholastic Apti- Raudenbush, Stephen W., Betsy J. Becker, and Hripsime A.
tude Test: Further synthesis and appraisal. Review of Edu- Kalaian. 1988. Modeling multivariate effect sizes. Psy-
cational Research 60(3): 373417. chological Bulletin 103(11): 11120.
Gleser, Leon J., and Ingram Olkin. 1994. Stochastically Timm, Neil H. 1999. Testing Multivariate Effect Sizes in
Dependent Effect Sizes. In The Handbook of Research Multiple-Endpoint Studies. Multivariate Behavioral Re-
Synthesis, edited by Harris Cooper and Larrry V. Hedges. search 34(4): 45765.
New York: Russell Sage Foundation. Trikalinos, Thomas A., and Ingram Olkin. 2008. A Method
. 2000. Meta-Analysis for 22 Tables with Multi- for the Meta-Analysis of Mutually Exclusive Outcomes.
ple Treatment Groups. In Meta-Analysis in Medicine and Statistics in Medicine 27(21): 42794300.
20
MODEL-BASED META-ANALYSIS

BETSY JANE BECKER


Florida State University

CONTENTS
20.1 What Are Model-Based Research Syntheses? 378
20.1.1 Model-Driven Meta-Analysis and Linked Meta-Analysis 378
20.1.2 Average Correlation Matrices as a Basis for Inference 379
20.1.3 The Standardized Multiple-Regression Model 379
20.1.4 Slope Coefficients, Standard Errors, and Tests 380
20.2 What We Can Learn from Models, But Not from Bivariate Meta-Analyses 380
20.2.1 Partial Effects 380
20.2.2 Indirect Effects 381
20.2.2.1 Mediator Effects 382
20.2.3 Model Comparisons 383
20.2.4 We Can Learn What Has Not Been Studied 383
20.2.5 Limitations of Linked Meta-Analysis 384
20.3 How to Conduct a Model-Based Synthesis 384
20.3.1 Problem Formulation 384
20.3.2 Data Collection 384
20.3.3 Data Management 385
20.3.4 Analysis and Interpretation 387
20.3.4.1 Distribution of the Correlation Vector r 387
20.3.4.2 Mean Correlation Matrix Under Fixed Effects 388
20.3.4.3 Test of Homogeneity, with H01: 1  . . .  k or 1  . . .  k 389
20.3.4.4 Estimating Between-Studies Variation 390
20.3.4.5 Random-Effects Mean Correlation 390
20.3.4.6 Test of No Association, with H02:  0 391
20.3.4.7 Estimating Linear Models 391
20.3.4.8 Moderator Analyses 392
20.4 Summary and Future Possibilities 393
20.5 Acknowledgments 394
20.6 References 394
377
378 SPECIAL STATISTICAL ISSUES AND PROBLEMS

20.1 WHAT ARE MODEL-BASED RESEARCH X1 X2 X3


SYNTHESES?
In this chapter I introduce model-based meta-analysis
and update what is known about research syntheses aimed
at examining models and questions more complex than
those addressed in typical bivariate meta-analyses. I begin Y
with the idea of model-based (or model-driven) meta- Sport performance
analysis and the related concept of linked meta-analysis,
and describe the benefits (and limitations) of model-based Figure 20.1 Prediction of Sport Performance
meta-analysis. In discussing how to do a model-based source: Authors compilation.
meta-analysis, I pay particular attention to the practical
concerns that differ from those relevant to a typical bi-
variate meta-analysis. To close, I provide a brief summary
X1 X2
of research that has been done on methods for model-
driven meta-analysis since the first edition of this volume
was published.

20.1.1 Model Driven Meta-Analysis


and Linked Meta-Analysis
X3
The term meta-analysis was coined by Gene Glass to
mean the analysis of analyses (1976). Originally, Glass
applied the idea to sets of results arising from series of
independent studies that had examined the same (or very
Y
similar) research questions. Many meta-analyses have
looked at relatively straightforward questions, such as Sport performance
whether a particular treatment or intervention has posi-
tive effects, or whether a particular predictor relates to an Figure 20.2 X3 Plays a Mediating Role in Sport Performance
outcome. However, linked or model-driven meta-analysis source: Authors compilation.
techniques can be used to analyze more complex chains
of events such as the prediction of behaviors based on
sets of precursor variables. well as directly. We can also examine a model like the one
Suppose a synthesist wanted to draw on the accumu- shown in figure 20.2 using model-based meta-analysis.
lated literature to understand the prediction of sport per- The ideas I cover in this chapter have been called by
formance from three aspects of anxiety. For now, let us two different nameslinked meta-analysis (Lipsey 1997)
refer to the anxiety measures as X1 through X3 and the and model-driven meta-analysis (Becker 2001). The esti-
outcome as Y. The synthesist might want to use meta- mation and data-analysis methods used have been vari-
analysis to estimate a model such as the one shown in ously referred to as estimating linear models (Becker
figure 20.1. I will use the term model to mean a set of 1992b, 1995), meta-analytic structural equation model-
postulated interrelationships among constructs or vari- ing, called MASEM by Mike Cheung and Wai Chan
ables (Becker and Schram 1994, 358). This model might (2005) and MA-SEM by Carolyn Furlow and Natasha
be represented in a single study as Y j  b0  b1 X1j  b2 Beretvas (2005), and two-stage structural equation mod-
X2j  b3 X3j for person j. Model-based meta-analysis pro- eling (TSSEM) (Cheung and Chan 2005).
vides one way of summarizing studies that could inform The term model-driven meta-analysis has been used to
us about this model. refer to meta-analyses designed to examine a particular
Suppose now that the synthesist thinks that the Xs may theoretical model. Although linked meta-analysis can
also influence each other, and wants to examine whether also be used to address models, in its simplest form the
X1 and X2 affect Y by way of the mediating effect of X3, as term simply means to link a set of meta-analyses on a
MODEL-BASED META-ANALYSIS 379

sequence of relationships. Linked or model-driven meta- tion a test of the appropriateness of the fixed-effects model
analyses go further than traditional meta-analyses by ask- for the entire correlation matrix can be computed (see, for
ing about chains of connections among predictors and example, Becker 2000).
outcomes (does A lead to B and B lead to C?). Linked Let us again consider the example above with three
meta-analyses have been used to look at relationships predictors of sport performance. The first step in the
over time (for example, Mark Lipsey and James Derzons model-based meta-analysis would be to gather estimates
1998 work on the prediction of serious delinquency), and of the correlations in the matrix R,
to ask about relationships of a set of predictors to an out-
come or set of outcomes (do A and B lead to C?). Early 1 r12 r13 r1Y
applications of the latter sort include my examination of r 1 r23 r2Y
science achievement outcomes and Sharon Brown and R  12 ,
r13 r23 1 r3Y
Larry Hedges analysis of metabolic control in diabetics
(Becker 1992a; Brown and Hedges 1994). Linked meta- r1Y r2Y r3Y 1
analysis or model-driven meta-analysis is therefore more
extensive and more complicated than a typical meta-anal- where rab is the correlation between Xa and Xb for a and
ysis, but can yield several benefits that go beyond those of b  1 to 3, and raY is the correlation of Xa with Y, for a  1
a more ordinary synthesis. Of course, with the added com- to 3. I use the subscript i to represent the ith study, so Ri
plexity come additional caveats; they will be addressed represents the correlation matrix from study i. The index
below. p represents the total number of variables in the matrix
(here p  4). Therefore any study may have up to p* 
p(p 1)/2 unique correlations. For this example p*  6.
20.1.2 Average Correlation Matrices as a Basis
Thus each studys matrix Ri can also contain up to mi  p*
for Inference
unique elements. Ri would contain mi  p* elements if
When multiple studies have contributed information about one or more variables are not examined in study i or if
a relationship between variables, the statistical tech- some correlations are not reported. Also, the rows and
niques of meta-analysis allow estimation of the strength columns of Ri can be arranged in any order, though it is
of the relationship between variables (on average) and the often convenient to place the most important or ultimate
amount of variation across studies in relationship strength, outcome in either the first or last position. In the example
often associated with the random-effects model (Hedges below, the correlations involving the sport performance
and Vevea 1998). outcome appear in the first row and column. From the
For a linked or model-driven meta-analysis, an average collection of Ri matrices the synthesist __ would then esti-
correlation matrix among all relevant variables is esti- mate an average correlation matrix, say R.
mated under either a fixed-effects model where all studies I will give more detail on how this average can be ob-
are presumed to arise from one population, or a random- tained, but for now I introduce one more itemthe cor-
effects model, which presumes the existence of a popula- relation vector. Suppose we list the unique elements of
tion of study effects. Ideally, of course, one would like to the matrix R in a vector, ri. Then ri  (ri12, ri13, ri1Y, ri23, ri2Y,
compute this average matrix from many large, represen- ri3Y) contains the same information as is shown in the
tative studies that include all the variables of interest, display of the square matrix Ri above. On occasion it may
measured with a high degree of specificity. In reality, the be useful to refer to the entries (ri12, ri13, ri1Y, ri23, ri2Y, ri3Y)
synthesist will likely be faced with a group of diverse as ri1 through rip*, or more generally as ri, for  1 to p*,
studies examining different subsets of variables relating in which case the elements of ri are indexed by just two
proposed predictors to the outcome or outcomes of inter- subscripts, and in our example ri  (ri1, ri2, ri3, ri4, ri5, ri6).
est. For this reason it is likely that the random-effects
model will apply. In the random-effects case, estimates of 20.1.3 The Standardized Multiple-Regression
variation across studies are incorporated into the uncer- Model
tainty of the mean correlations, allowing the estimates of
the average correlations to be generalized to a broader For
__ now let us assume the synthesist has an estimate of
range of situations, but also adding to their uncertainty R. The next step in model-driven meta-analysis is to use
(and widening associated confidence intervals). In addi- the matrix of average correlations to estimate an equation
380 SPECIAL STATISTICAL ISSUES AND PROBLEMS

or a series of regression equations relating the predictors covariation between slopes in making comparisons of the
X1 through X3 to Y. These equations provide a picture of relative importance of slopes.
the relationships among the variables that takes their co- Finally, comparisons of slopes from models computed
variation into account (Becker 2000; Becker and Schram for different samples (from different sets of studies) are
1994) and this is where the possibility of examining par- available. Thus it is possible to compare, say, models for
tial relations, indirect effects, and mediating effects is males and females or for other subgroups of participants
manifest. for example, old versus young participants, or experts ver-
If the synthesist wanted to examine the model shown sus novices.
in figure 20.2, two regression equations would be ob-
tained. One relates X1 and X2 to X3, and the other relates 20.2 WHAT WE CAN LEARN FROM MODELS,
X1 through X3 to Y. Because the primary studies may use BUT NOT FROM BIVARIATE META-ANALYSES
different measures of these four variables, and those mea-
sures may be on different scales, it may not be possible Unlike typical bivariate meta-analyses, linked or model-
to estimate raw regression equations or to directly com- driven meta-analyses address how sets of predictors re-
bine regressions of X3 or Y on the relevant predictors. late to an outcome (or outcomes) of interest. In a linked
(Later I briefly discuss issues related to the synthesis of meta-analysis we may examine partial relations, because
regression equations.) However, the synthesist can esti- we will model or statistically control for additional vari-
mate a standardized regression __ model or path model ables, and it is also possible to examine indirect relations
using the estimated mean matrix R (and its variance co- and mediator effects. I provide examples of how each of
variance matrix). these sorts of effects has been examined in existing linked
Recent promising work by Soyeon Ahn has also begun or model-driven meta-analyses.
to examine the possibility of using meta-analytic correla-
tion data to estimate latent variable models (Ahn 2007; 20.2.1 Partial Effects
Thum and Ahn 2007). Adam Hafdahl has also examined
factor analytic models based on mean correlations (2001). Most meta-analyses of associations take one predictor at
In this chapter, however, I focus only on path models a time, and ask if the predictor X relates to the outcome Y.
among observed variables. In the real world, however, things are usually not so sim-
ple. Consider the issue of the effectiveness of alternate
20.1.4 Slope Coefficients, Standard Errors, routes to teacher certification. In many studies, this is
and Tests addressed by a simple comparison: teachers trained in
traditional teacher-education programs are compared on
As with any standardized regression equation in primary some outcome to teachers who have gone through an al-
research or in the context of structural equation model- ternate certification program. An example of such a de-
ing, the model-based estimation procedure produces a sign is found in a study by John Beery (1962). However,
set of standardized slopes and standard errors for those prospective teachers who seek alternate certification are
slopes. Individual slopes can be examined to determine often older than those going though the traditional four-
whether each differs from zero (that is, the synthesist can year college programs in teacher education. To make a
test whether each *j  0), which indicates a statistically fair comparison of the effectiveness of alternate-routes
significant contribution of the tested predictor to the rel- programs, one should control for these anticipated differ-
evant outcome. An overall, or omnibus, test of whether all ences in teacher age. This control can be exerted through
standardized slopes equal zero is also available, thus the design variations, such as by selecting teachers of the
synthesist can test the hypothesis that *j  0 for all j. same age, or by statistical means, such as by adding a
Because standardized regression models express pre- measure of teacher age to a model predicting the outcome
dictoroutcome relationships in terms of standard-devia- of interest. Studies of alternate routes have used various
tion units of change in the variables, comparisons of the design controls and also included different statistical con-
relative importance of different predictors are also possi- trol variables to partial out unwanted factors (see, for ex-
ble. As is true for ordinary least squares (OLS) regres- ample, Goldhaber and Brewer 1999).
sion slopes in primary studies, the slopes themselves Model-driven meta-analysis can incorporate such ad-
can be interrelated, and appropriate tests account for the ditional variables that need to be controlled into a more
MODEL-BASED META-ANALYSIS 381

controlling for other CSAI components, was weak, and


Cognitive Somatic just barely significant at the traditional .05 level, with
(X1) (X2) z  1.961, p  .05. Thus when one controls for the role of
.25* .36* self confidence, other aspects of anxiety appear to play a
role in performance.

Self
confidence 20.2.2 Indirect Effects
(X3)
A second major benefit of the use of model-driven or
.13* .33* .06* linked meta-analyses is the ability to examine indirect ef-
fects. In some situations, predictors may have an impact
on an outcome through the force of an intervening or me-
diating variable. Such a predictor might have no direct
(Y ) effect on the outcome, but have influence only through the
third mediating factor. The analysis of indirect relation-
Sport performance ships is a key aspect of primary-study analyses using struc-
tural equation models (see, for example, Kaplan 2000). In
Figure 20.3 Prediction of Sport Performance from Compo- many areas of psychology and education, complex mod-
nents of Anxiety els have been posited for a variety of outcomes. It makes
source: Authors compilation. sense that synthesist might want to consider such models
in the context of meta-analysis.
Model-driven meta-analyses can provide this same
complex model than is possible in a traditional bivariate benefit by examining indirect relationships among the
meta-analysis. A specific example from an application in predictors outlined in theoretical models of the outcome
sport psychology shows this benefit. Lynette Craft and of interest. An example comes from a model-driven syn-
her colleagues examined the role of a three-part measure thesis of the predictors of child outcomes in divorcing
of anxiety in the prediction of sport performance (2003). families (Whiteside and Becker 2000). An important con-
The Competitive Sport Anxiety Index (CSAI) includes sideration in child custody decisions when parents divorce
measures of cognitive aspects of anxiety, denoted as X1, is the extent of parental visitation for the noncustodial
and somatic anxiety, or physical symptoms, denoted as parent. Often when the mother is granted custody, a deci-
X2, as well as a measure of self confidence, denoted as X3 sion must be made about the extent of father visitation
(Martens, Vealey, and Burton 1990). An important ques- and contact with the child. Curiously and counterintui-
tion concerns whether these three factors contribute sig- tively, narrative syntheses of the influence of the extent of
nificantly to sport performance, as shown in figure 20.3. father visitation on child adjustment outcomes showed
Estimates of the bivariate correlations of the three only a weak relationship to child outcomes such as inter-
aspects of anxiety to performance show mixed results. nalizing and externalizing behaviors (see, for example,
Although self confidence correlates
_
significantly with Hodges 1991).
performance across studies (r.3Y  .21, p  .001), the other Indeed when Mary Whiteside and I examined the bi-
anxiety subscores showed virtually
_
no zero-order
_
relation variate associations of measures of father-child contact
to performance, on average (r.1Y  .01, r.2Y  .03, both with adjustment outcomes in their meta-analysis, only
not significant). By modeling all three components to- the quality of the father-child relationship showed a sig-
gether, as shown in figure 20.3, Craft and her colleagues nificant association with child outcome variables. Spe-
found that though the bivariate relationships of cognitive cifically, a good father-child relationship was related to
and somatic anxiety to performance were not significant, higher levels of cognitive skills for the child (a good out-
when the three components were considered together in a come), and to lower scores on externalizing symptoms
regression model, all the subscales showed overall rela- and internalizing symptoms (also sensible, because high
tions to sport performance, even under a random-effects internalizing and externalizing symptoms reflect prob-
model. The effect of somatic anxiety on performance, lems for the child). Measures of father visitation, and
382 SPECIAL STATISTICAL ISSUES AND PROBLEMS

Preseparation Father Involvement

b  .22 p  .001 (k  2)

b
.33
b  .15 p  .05 (k 5)

p
Father-Child Relationship Quality Internalizing

.000
Symptoms
3) 1 (k b .29 p .0001 (k  4) Externalizing
 2

b
(k Symptoms
.06
)
Frequency of Father Visitation

.09
p 2
)
.18 (k Social Skills

p
.01

b  .17 p  .01 (k 1)


 p

b

.15
b 4
.2

.15
b

(k 
p
b .46 p
Hostility  .0001 (k

4)
.03
 2)

(k
b


.

3)
42
b

p 2) Cooperation
(k 
.13

.00
1( .13
p
p

k 1
1) .2
.4

b  .05 (k  2)
(k

b .28 p Total Behavior



2)

Maternal Warmth Problems


2)
(k 
b  .16 p.08 (k  1) 01
p .
.32
b
Maternal Depressive Symptoms

Figure 20.4 Model of Child Outcomes in Divorcing Families


source: Whiteside and Becker 2000.

pre-separation and current levels of father involvement, and who had good relationships with their fathers, also
showed nonsignificant relationships with child outcomes. showed more positive outcomes. These results would not
However, when the father-child contact variables were have been found had the indirect relationships specified
examined in the context of a more realistic multivariate in the model-driven synthesis not been examined.
model for child outcomes, Whiteside and I found that ex- 20.2.2.1 Mediator Effects Because it has the poten-
tent of father contact showed a significant indirect rela- tial to model indirect effects, model-driven meta-analysis
tionship, through the variable of father-child relationship also allows for mediating effects to be tested. One exam-
quality. Figure 20.4 shows the empirical model. Both pre- ple of an application where the roles of mediators were
separation father involvement and extent of father visita- critical concerns the effects of risk and protective factors
tion showed positive indirect impacts on child outcomes. for substance abuse (Collins, Johnson, and Becker 2007).
Whiteside and I interpreted this finding to mean that In this synthesis of community-based interventions for
when a child has a good relationship with their father, substance-abuse prevention, David Collins, Knowlton
more father contact has a good influence on psychologi- Johnson, and I were interested in the question of whether
cal outcomes. In addition, children with higher levels of risk and protective factors, such as friends use of drugs
involvement with their father before parental separation, and parental views of drug use, mediated the impact of
MODEL-BASED META-ANALYSIS 383

preventive interventions on a variety of substance-abuse whether the same predictive relationships hold for two or
outcomes. This was accomplished by using a variation of more different groups. Whiteside and I wanted to know
the approach proposed by Charles Judd and David Kenny whether the same predictors were important for child out-
(1981). One could also test for mediation in other ways, comes for young boys and young girls in divorcing fami-
using variations on the approaches described by David lies (2000). Craft and her colleagues coded a variety of
MacKinnon and his colleagues (2002). aspects of the athletes and sports examined in the studies
Collins and his colleagues examined three models for of anxiety (2003). Team and individual sports were com-
each of six prevalence outcomes. The first were interven- pared, as were elite athletes, European club athletes, col-
tion effect models that had prevalence of substance abuse lege athletes, and college physical education students. Two
as the outcome, and used as predictors three intervention other moderators were type of skill (open versus closed)
variables (intervention, year, and their interaction) plus and time of administration of the anxiety scale (in terms
covariates such as gender and ethnicity. In these models, of minutes prior to performance).
a negative slope for the intervention by year interaction Complications can arise when moderator variables
indicated that the treatment was associated with a reduc- are examined. Most common is that estimating the same
tion in the prevalence of substance abuse between the models examined for the whole collection of studies may
baseline year and the post-intervention year. If a preva- not always be possible for subsets of studies. This occurs
lence outcome showed a significant negative slope for the when some variables have not been examined in the sub-
intervention by year interaction in the first model, the sets of studies. When Whiteside and I selected samples
outcome had the potential to show a mediating effect of where results had been reported separately by gender,
the risk and protective factors. we found that several variables had not been studied for
The second set of models had as outcomes each of the separate-gender samples. Four correlations that had
twelve risk and protective factors, with the same predic- been examined for mixed gender samples had not been
tors as previously mentioned. The authors identified the studied in the separate-gender groups. This meant some
specific risk and protective factors that could be mediat- of the models that were estimated for the full set of sev-
ing the intervention effects by finding those with signifi- enteen studies could not be examined for the five samples
cant treatment effects for the intervention. of boys and five samples of girls.
Finally, a third set of mediation models again had the
prevalence behaviors as outcomes, but included not only 20.2.4 We Can Learn What Has Not Been Studied
the intervention variables and covariates, but also all the
risk and protective factors viewed as outcomes in the sec- A final benefit of linked meta-analysis based on a specific
ond set of models. The mediating effect of a risk or pro- model is that it can enable the synthesist to identify rela-
tective factor was apparent if a significant intervention by tionships that have not been studied, or have not been stud-
year interaction (from the first set of models) became ied extensively. Absence of data means that a typically
nonsignificant when the risk and protective factors were bivariate synthesis must be abandoned, because connec-
added as predictors in mediation models. Mediating ef- tions between the pair of variables are necessary to sup-
fects were found for three outcomes: sustained cigarette port any analysis. In the context of more-complex linked
use, alcohol use, and binge drinking. The two risk factors models however, the absence of data on a particular rela-
identified as potential mediators were friends drug use tionship might restrict the complexity of some analyses,
and perceived availability of drugs. Again in this exam- but does not imply that no analyses can be done. In addi-
ple, mediating effects could not have been analyzed in a tion, such missing data can point out possible areas for
typical meta-analytic examination of the pairwise rela- future research.
tionships among risk and protective factors and the out- Two examples of how model-driven linked meta-anal-
comes. yses help to identify gaps in the literature can be drawn
from previous syntheses. In a synthesis to examine the
20.2.3 Model Comparisons prediction of science achievement for males and females,
I wanted to examine the extent to which an elaborate
Other analyses that are possible in model-based meta- model for school achievement behaviors proposed by Jac-
analyses are comparisons of models based on moderator quelynne Eccles and her colleagues had been examined
variables. A synthesist may want to ask, for instance, by others in the context of science achievement (Becker
384 SPECIAL STATISTICAL ISSUES AND PROBLEMS

1992a; model in Eccles-Parsons 1984). Admittedly, the samples (or populations) from the samples examining
literature I gathered had not aimed to examine this model, variables B and C. This is a particular concern for rela-
but it still seemed reasonable to ask whether evidence was tions based on only a few studies. Using random-effects
available in the literature to support or refute the model. I models and testing for the consistency of results across
found that most of the connections between the variables studies can help address this concern by providing esti-
specified in Eccles model had not been investigated at mates that apply broadly and by incorporating uncertainty
all, yet other connections, such as the relationship be- to account for between-studies variation in the strength of
tween aptitudes and science achievement, were well stud- relationships under study.
ied. In the synthesis of prediction of child outcomes in
divorcing families, Whiteside and I found no correlations 20.3 HOW TO CONDUCT A MODEL-BASED
for maternal restrictiveness with two other predictors SYNTHESIS
restrictiveness had not been studied with coparent hostil-
ity and maternal depressive symptoms (2000). For this Many of the tasks required in conducting a model-driven
reason, the two pairs of predictors (restrictiveness and meta-analysis are essentially the same as those required
hostility, and restrictiveness and maternal depression) in doing a typical meta-analysis. Because Schram and I
could not be used together to predict any outcome. went into great detail about some of these tasks, I do not
Even if all relationships have been studied, a model- examine all of their ideas here (Becker and Schram
driven meta-analysis can identify when some relation- 1994).
ships are better understood than others. In the analysis of
Craft and colleagues, each relationship between an aspect 20.3.1 Problem Formulation
of anxiety and sport performance was represented by at
least fifty-five correlation values. However, the intercor- Problem formulation begins the process, and here the
relations between the subscales of the CSAI were not re- synthesist often begins with a model like the ones just
ported in all of the studies examined. These relationships shownwith key model components specified. These
were observed between twenty-two and twenty-five times, components can be used in computerized database
depending on the pair of subscales examined. searches. In a study of sport performance, for example, a
particular instrument had been used to study anxietythe
20.2.5 Limitations of Linked Meta-Analysis Competitive State Anxiety Indexso searches used that
specific name to locate studies. Most studies presented
As does any statistical analysis technique, model-driven correlations of each subscore with performance, which
or linked meta-analysis has its limitations. Of particular was the main focus of the simple model in figure 20.3. In
concern is the issue of missing data. When a particular more complex problems, like the study of impacts on chil-
relationship has not been studied, one cannot even esti- dren of divorce, it is unlikely that all studies will report
mate the degree of relatedness between the variables of correlations among all variables. In fact none of the stud-
interest. The positive aspect is that this information may ies Whiteside and I gathered reported on all of the correla-
provide future directions for research. Missing data can tions among the fourteen variables studied (eleven of
also, however, lead to analytical problems. Fortunately, which are shown in figure 20.4). Whiteside and I did set
some advances have been made in dealing with missing inclusion rules (2000); one of these was that every study
data in the context of correlation matrices (Furlow and in the synthesis must have reported a correlation involving
Beretvas 2005); these will be discussed briefly in the next some outcome with either father-child contact or the co-
section of the chapter. parenting relationship (measured as hostility or coopera-
A second concern is that the results of a model-driven tion). These two variables were chosen because they can
meta-analysis may be fractionated and not apply well to be influenced by the courts in divorce custody decisions.
any particular population. In theory, the correlations rel-
evant to one relationship can arise from a completely dif- 20.3.2 Data Collection
ferent set of studies than those that provide results for
other relationships, though this is rare. Thus the connec- For a linked or model-driven meta-analysis, an average
tion between variables A and C can be based on different correlation matrix is estimated. Eventually a regression
MODEL-BASED META-ANALYSIS 385

Performance (Y ) Cognitive anxiety Somatic anxiety Self confidence

Performance (Y ) 1.0 .10 .31 .17

Cognitive anxiety C1 1.0 NR NR

Somatic anxiety C2 C4 1.0 NR

Self confidence C3 C5 C6 1.0

Figure 20.5 Form for Recording Values of Correlations


source: Authors compilation.

model or path model of relationships among the vari- desired correlations. This also leads to an increased im-
ables is as well. What this means for model-driven meta- portance of testing for consistency in the full correlation
analysis is that the synthesist searches for studies pre- matrix across studies, and of examining the influence of
senting correlations for sets of variablesas many of the studies that contribute many correlations to the analysis,
variables of interest as can be found. Searching strate- for instance, by way of sensitivity analysis.
gies often involve systematic pairings of relevant terms.
For example, in the meta-analysis of outcomes for young 20.3.3 Data Management
children in divorcing families, Whiteside and I required
that each study include either a measure of fathers con- Because one goal of a model-driven meta-analysis is to
tact with the child or a measure of the quality of interac- compute a correlation matrix, it is best to lay out the con-
tions between the parents. Hostility or cooperation were tents of that matrix early in the data-collection process so
often measured. Clearly, a measure of some child out- that when correlations are found in studies they can be
come also had to be available, and zero-order correla- recorded in a systematic way. When the data are ana-
tions among these and other variables had to be either lyzed, they will be read in the format of a typical data
reported or available from the primary study author or recordthat is, they will appear in a line or row of a data
authors. Such search requirements often lead to intersec- matrix, not in the form of a square correlation matrix. It is
tion searches. For instance, the search parameters (hos- sometimes easier, however, to record the data on a form
tility or cooperation) AND (anxiety or internalizing be- that looks like the real matrix, and then to enter the data
haviors or depression or externalizing behaviors) should from that form into a file. Thus, for the study of anxiety,
identify studies that would meet one of the conditions one might make a form such that shown in figure 20.5,
required of all studies in the Whiteside and Becker syn- which shows below the diagonal a variable name for each
thesis. correlation (these are ordered as the correlations appear
This does not mean, however, that only studies with in the matrix) and shows above the diagonal the values of
correlations among all variables would be includedif the correlations for our study 17 (Edwards and Hardy
this requirement were set, no one would ever be able to 1996). When a correlation does not appear, the entry above
complete a model-driven meta-analysis. Instead, each the diagonal may be left blank, but it may be preferable
study likely will contribute a part of the information in to note its absence with a mark, such as NR for not
the correlation matrix. Some studies will examine more reported. When primary study authors have omitted val-
of the correlations than others and some will report all ues that were not significant, one could enter NS so that
386 SPECIAL STATISTICAL ISSUES AND PROBLEMS

Table 20.1 Relationships Between CSAI Subscales and Sport Behavior

Variable Names and Corresponding Correlations

Cognitive/ Somatic/
Cognitive/ Somatic/ Self Confidence/ Cognitive/ Self Self
Performance Performance Performance Somatic Confidence Confidence

Type C1 C2 C3 C4 C5 C6
ID ni of sport ri 1Y ri 2Y ri 3Y ri12 ri13 ri 23

1 142 2 .55 .48 .66 .47 .38 .46


3 37 2 .53 .12 .03 .52 .48 .40
6 16 1 .44 .46 9.00 .67 9.00 9.00
10 14 2 .39 .17 .19 .21 .54 .43
17 45 2 .10 .31 .17 9.00 9.00 9.00
22 100 2 .23 .08 .51 .45 .29 .44
26 51 1 .52 .43 .16 .57 .18 .26
28 128 1 .14 .02 .13 .56 .53 .27
36 70 1 .01 .16 .42 .62 .46 .54
38 30 2 .27 .13 .15 .63 .68 .71

source: Craft et al. 2003.

later on a count could be obtained of the omitted nonsig- example, type of performance outcome might be relevant
nificant results as a crude index of the possibility of publi- mainly to the correlations involving the X-Y relations in
cation bias. Often, however, it is not possible to determine our example.
why a result has not been reported. In the case shown in Thus for each case the data will include, at a minimum,
figure 20.5, only the correlations of the anxiety scales a study identification code, the sample size and up to p*
with the outcome were reported. This is a fairly common (here, six) correlations. For the example, I have also in-
pattern for unreported data. Also, for practical reasons, cluded a variable representing whether the sport behavior
this form is arranged so the matrix has the correlations examined was a team (Type  1) or individual (Type  2)
involving the performance outcome in the first row. These sport. Table 20.1 shows the data for ten studies drawn
are the most frequently reported correlations and, because from the Craft et al. meta-analysis (2003). The record for a
they are also the first three values, can easily be typed study shows a value of 9.00 when a correlation is missing.
first when creating the data file, to be followed by three The fifth line of table 20.1 shows the data from study 17,
missing-data codes if the predictor correlations are not also displayed in figure 20.5.
reported. Finally, one additional type of information will eventu-
The one other variable the synthesist must record is the ally be needed in the multivariate approach to synthesiz-
sample size. When primary studies have used pairwise ing correlation matrices. Specifically, a set of indicator
deletion, the rs for different relations in a matrix may be variables is needed to identify which correlations are re-
based on different sample sizes. In such cases, I generally ported in each study. Because the analysis methods rely
record the smallest sample size shown, as a conservative on representations of the study data in vectors of correla-
choice of the single value to use for all relationships in tions, it is perhaps simplest to describe the indicators in
that study. In most meta-analyses, the synthesist will also terms of matrices. Matrices are also used in the example.
record values of other coded variables that are of interest If data are stored in the form of one record per correla-
as moderators or simply as descriptors of the primary tion, then a set of p* variables can be created to serve as
studies. There are no particular requirements for those indicators for the relationships under study.
additional variables, and each relationship in a matrix or Let us consider studies with IDs 3 and 6 in table 20.1.
each variable may have its own moderator variables. For Study 3 has provided estimates of all six correlations
MODEL-BASED META-ANALYSIS 387

among the CSAI subscales and a performance outcome. 20.3.4.1 Distribution of the Correlation Vector r
We can represent the results of study 3 with a 6 1 vector The methods presented for the synthesis of correlation
r3 and the 6  6 indicator matrix X3, such that r3  matrices are based on generalized least (GLS) methods
X3  e3 with for synthesis, designed after the approach laid out by Ste-
phen Raudenbush, Betsy Becker, and Hripsime Kalaian
.53 1 0 0 0 0 0 1 (1988) and Becker and Schram (1994). These methods
.12 0 require that we have variances and covariances for the
1 0 0 0 0 2 correlations we summarize. Ingram Olkin and Minoru
.03 0 0 1 0 0 0 3 Siotani showed that in large samples of size ni, the cor-
r3  , X3  , and = ,
.52 0 0 0 1 0 0 4 relation vector ri is approximately normally distributed
.48 0 0 0 0 1 0 with mean vector i and variance-covariance matrix i,
5 where the elements of i are defined by i
(1976).
.40 0 0 0 0 0 1 6 Recall that in section 20.1.2 we had simplified the vec-
tor notation for ri  (ri12, ri13, rist , . . . , ri ( p1)p), by using a
single subscript in place of the pair of indices s, t. Here,
where e3 represents the deviation of the observed values
to distinguish among the p variables in each study, we
in r3 from the unknown population values in . Here X3 is
must return to the full subscripting, because the covari-
an identity matrix because study 3 has six correlations
ance of rist with riuv involves all of the cross-correlations
and the columns of X represent each of the six correla-
(rist , risu , risv, ritu , ritv , and riuv). Formulas for the variances
tions being studied. Study 6 does not appear to have used
and covariances are given by Olkin and Siotani (1976,
a measure of self-confidence, however, and reports only
238). Specifically
three correlations: for the first, second, and fourth rela-
tionships in the correlation matrix. Thus the data from i

 Var(ri
)  Var(rist)  (1  ist2)2/ni , (20.1)
study 6 are represented by
and
.44 1 0 0 0 0 0
i
 Cov(ri
, ri )  Cov(rist, riuv).
r6  .46 and X 6  0 1 0 0 0 0 .
.67 0 0 0 1 0 0 The covariance i
can be expressed by noting that if
ri
 rist , the correlation between the sth and tth variables
in study i, ri  riuv, and ist and iuv are the corresponding
The indicator matrix still contains six columns to allow population values, then
for the six correlations we hope to summarize across all
studies. Cov(rist, riuv )  [0.5 ist iuv (isu  isv  itu  itv)
 isu itv  isv itu  (ist isu isv
20.3.4 Analysis and Interpretation  its itu itv  ius iut iuv
The synthesist typically begins a model-driven analysis  ivs ivt ivu)]/ni (20.2)
by estimating a correlation matrix across studies. The
task is to find patterns that are supported across studies for s, t, u, and v  1 to p. Typically i

and i
are esti-
and to discuss potential explanations for any important mated by substituting the corresponding sample estimates
variability in the correlations of interest across studies. for the parameters in equations 20.1 and 20.2. Also some-
Another goal is to explicitly identify missing (unstudied) times the denominators in equations 20.1 and 20.2 are
relationships or less-well-studied relationships to suggest computed as (ni  1) rather than ni.
fruitful areas for future research. Moderator analyses, In the data analysis using GLS methods, these vari-
such as comparing results for team versus individual ances and covariances must be computed. However re-
sports, may also be conducted. Before describing these search has indicated that substituting individual study
steps in the process of data analysis, I present a little ad- correlations into formulas 1 and 2 is not the best approach
ditional notation. (Becker and Fahrbach 1994; Furlow and Beretvas 2005).
388 SPECIAL STATISTICAL ISSUES AND PROBLEMS

In short, I recommend computing these variances and co- components when adopting a random-effects model, until
variances using values of the mean correlations across further research or improved software provide better esti-
studies. mators.
However, another issue is whether to use raw correla- Example. I consider the first study from the used by
tions for the data analysis or to transform them using Sir from Craft and her colleagues (2003). Study 1 has re-
Ronald Fishers z transformation (1921). Fishers trans- ported all six correlation values, shown here in square as
formation is well as in column form and transformed into the z metric,
specifically
z  0.5 log [(1  r)/(1  r)], (20.3)
1 .55 .48 .66
which is variance-stabilizing. Thus the variance of zi
is .55 1 . 47 .38
usually taken to be i

 ni1 or (ni  3)1, which does R1  or


.48 .47 1 .46
not involve the parameter i
or any function of i
. The
covariance between two z values, however, is even more .66 .38 .46 1
complicated than the covariance in 2, and is given by
.55 .618
i
 Cov(zi
, zi )  i
/[(1 i
2)(1 i 2)]. (20.4) .48 .523

.66 .793
These elements form the covariance matrix for the z val- r1  and z1 = .
ues from study i, i. Work by myself and Kyle Fahrbach .47 .510
(1994) and by Carolyn Furlow and Natasha Beretvas .38 .400
(2005) also supports the use of the Fisher transformation
.46 .497
both in terms of estimation of the mean correlation matrix,
and to improve the behavior of associated test statistics,
The covariance of r1, computed using sample-size
which are described shortly. For much of this presenta- _
weighted mean correlations (r  (.074 .127 .324
tion, I offer formulas for tests and estimates based on the
.523 .416 .414)), is
use of Fishers z transformation.
These formulas apply under the fixed-effects model, .0070 .0036 .0025 .0005 .0018 .0009
that is, when one can assume that all k correlation matri- .0036 .0068 .0025 .0002 .0008 .0017

ces arise from a single population. If that is unreason- .0025 .0025 .0056 .0001 .0000 .0003
able, either on theoretical grounds or because tests of Cov(r1 ) = ,
.0005 .0002 .0001 .0037 .0013 .0013
homogeneity are significant, then one should augment .0018 .0008 .0000 .0013 .0048 .0022
the within-study uncertainty quantified in i by the addi-
.0009 .0017 .0003 .0013 .0022 .0048
tion of between-studies variation. A number of approaches
exist for estimating between-studies variation in the uni- and the covariance of z1 computed using the same mean
variate context, but little work has been done in the multi- correlations is
variate context (see, for example, chapter 18, this volume;
Sidik and Jonkman 2005). My own practical experience, .0072 .0037 .0029 .0008 .0022 .0011
based on applications to a variety of data sets, is that esti- .0037 .0072 .0029 .0003 .0010 .0022

mating between-studies variances is relatively simple, .0029 .0029 .0072 .0001 .0001 .0004
Cov(z1 ) = .
but estimating between-studies covariance components .0008 .0003 .0001 .0072 .0022 .0023
can be problematic. For instance, particularly with small .0022 .0010 .0001 .0022 .0072 .0033
numbers of studies, one can obtain between-studies cor-
.0011 .0022 .0004 .0023 .0033 .0072
relations above 1. If one has a set of studies with com-
plete data and can use, for instance, a multivariate hierar- If a study reports fewer than p* correlations the Cov(ri)
chical modeling approach these problems may be resolved and Cov(zi) matrices will be reduced in size accordingly.
(Kalaian and Raudenbush 1996). It is rare, however, to 20.3.4.2 Mean Correlation Matrix Under Fixed Ef-
have complete data when summarizing correlation matri- fects If we denote the Fisher transformation of the popu-
ces. My recommendation is thus to add only the variance lation correlation  using the Greek letter zeta ( ), we
MODEL-BASED META-ANALYSIS 389

have  0.5 log [(1 )/(1 )]. Generalized least squares data would be nested, but if the second study instead were
methods provide the estimator missing r1 and r3, the data would not be nested and the
method could not be applied.
 (X
1 X)1 X
1z, (20.5) Example. For the ten studies in our example, the mean
correlation matrix obtained from analysis of the correla-
where X is a stack of k p*  p* indicator matrices, z is the tions is
vector of z-transformed correlations, and is a block-

wise diagonal matrix with the matrices i  Cov(zi) on 1 .074 .127 .317
its diagonal. The variance of is .074 1 .523 .415
= ,
1X)1.
 (X .127 .523 1 .405
Var() (20.6)
.317 .415 ..405 1
The elements of are returned to the correlation metric
via the inverse Fisher transformation to yield the estima- and the mean based on the Fisher z transformed correla-
~ of , given by
tor ~  ( ~ ), where ~  [exp(2 )1]/ tions is


[exp(2)  1], for
 1 to p*. Also, because the trans-
formation that puts the covariance matrix in equation 20.6 1 .092 .141 .364
back on the correlation scale is approximate, it is typical .092 1 . 586 .449
% = .
to compute confidence intervals in the z metric, and to .141 .586 1 .438
back-transform the endpoints of each confidence interval
or CI to the correlation scale. Any tests of differences .364 .449 .438 1
among elements of the correlation matrix would be done
The first rows of these matrices contain the correla-
in the z metric, though these are not common.
tions of the CSAI subscales with sport behavior. Clearly,
Parallel methods can be applied to untransformed cor-
the last entries in that row, which represent the average
relations, with
correlation for self confidence with sport behavior, are
 1X)1 X
 (X  1 r, (20.7) the largest. As we will see shortly, however, these values
should not be interpreted because the fixed-effects model
where X is a stack of k p*  p* indicator matrices, r is is not appropriate for these data.
is a blockwise diagonal
the vector of correlations, and 20.3.4.3 Test of Homogeneity, with H01: 1  . . .  k
matrix with the matrices on its diagonal. The variance or 1  . . .  k In a typical univariate meta-analysis, one
i
of is of the first hypotheses to be tested is whether all studies
appear to be drawn from a single population. This test
Var() 1X)1.
 (X (20.8) can conceptually be generalized to the case of correlation
matrices, and here the meta-analyst may want to test
These estimators have been extensively studied and whether all k correlation matrices appear to be equal or
other estimation methods have been proposed. Ingram homogeneous. A test of the hypothesis of homogeneity of
Olkin and I have shown that the generalized least squares the k population correlation matrices is a test of H01:
estimators given here are also maximum likelihood esti- 1  . . .  k. Similar logic leads to a test of the hypoth-
mators (Becker and Olkin 2008). We also provide an es- esis 1  . . .  k, which is logically equivalent to the hy-
timator that is the matrix closest to a set of correlation pothesis H01 that all the i are equal. The recommended
matrices, in terms of a general minimum distance func- test, which Fahrbach and I found to have a distribution
tion. Meng-Jia Wu has investigated a method based on much closer to the theoretical chi-square distribution than
factored likelihoods and the sweep operator that appears a similar test based on raw correlations (Becker Fahrbach
to perform well in many conditions (2006). However, the 1994), uses the statistic
limitation of Wus approach is that if any data are miss-
ing, the studies must have a nested structure, such that the QEZ  z[
1
1X (X
1X)1 X
1]z. (20.9)
data missing in any study must be a subset of what is
missing in other studies. If a study had no values for r1 When H01 is true, QEZ has approximately a chi-square
and r2, and another study were missing r1, r2 and r3, the distribution with (k 1)p* degrees of freedom if all studies
390 SPECIAL STATISTICAL ISSUES AND PROBLEMS

report all correlation values. If some data are missing the matrix TR) and then the estimate T R is added to each to
i
degrees of freedom equal the total number of observed RE R
get i  i T , then would then be used in place
RE

correlations across studies minus p*. of in further estimates, as shown in equations 20.12
Mike Cheung and Wai Chan investigated the chi-square and 20.13 below.
test of homogeneity based on raw correlations, and found Example. For our example a simple method of mo-
as I and Fahrbach did that the analogue to QEZ computed ments estimator is computed for each set of transformed
using correlations had a rejection rate that was well above correlations (see chapter 16 this volume). Based on the
the nominal rate (Cheung and Chan 2005; Becker and homogeneity tests, we expect the 2 estimates to be non-
Fahrbach 1994). Cheung and Chan proposed using a Bon- zero for the correlations of CSAI scales with the outcome,
ferroni adjusted test which draws on the individual tests but the three estimates for the CSAI intercorrelations to
of homogeneity (see chapter 14, this volume) for the p* be very close to zero. The values in the z metric are shown
relationships represented in the correlation matrix. We in the diagonal matrix
reject the hypothesis H01: pi  p if at least one of the p*
individual homogeneity tests is significant at level /p*.
.159 0 0 0 0 0
Cheung and Chan refer to this as the BA1 (Bonferroni At 0 .075
least 1) rule. Thus, for example, in the Craft et al. data 0 0 0 0
with p*  6 correlations, the adjusted significance level 0
Z  0
T
0 .096 0 0
,
for an
 .05 test would be .05/6  .0083. Cheung and
0 0 0 0 0 0
Chan examined tests of the homogeneity of raw correla- 0 0 0 0 .013 0
tions, but this adjustment can also be applied to homoge-
neity tests for the Fisher z transformed rs. 0 0 0 0 0 .012
Example. For the ten studies in our example there are
fifty-four reported correlations (k  p* 10  6  60 were
possible, but six are missing) and the value of QEZ  209.18 and indeed the values are as expected, with one value
with df  54  6  48. This is significant with p  .001. the variance component for the correlations between cog-
In addition, we can consider the six individual homoge- nitive and somatic anxietyestimated at just below zero.
neity tests for the six relationships in the matrix. Accord- Thus this value was set to 0. Also, the first three entries are
ing to the BA1 rule, at least one test must be significant at larger than the last two. To interpret the values, one can
the .0083 level. In fact, all three of the homogeneity tests take the square root of each diagonal element to get the
of the CSAI subscales with sport performance are highly estimated standard deviation of the population of Fisher z
significant (with p values well below .0001), and the three values for the relationship in question. For instance, the i
sets of correlations among the subscales are each homo- values for correlations of cognitive anxiety with sport be-
_____
geneous. havior have a standard deviation of 1  0.159  0.40. If
20.3.4.4 Estimating Between-Studies Variation If a the true average correlation (or ) were zero, then roughly
random-effects model seems sensible, either on principle 95 percent of the population z values would be in the in-
or because the hypothesis of homogeneity just described terval .781 to .781 and 95 percent of the population cor-
has been rejected, the meta-analyst will need to estimate relations would fall between .65 and .65. This range is
the variance across studies in each set of correlations, say analogous to the plausible values range computed for hi-
2 for the
th correlation, with
 1 to p*. A variety of erarchical linear model parameters at level 2 (for exam-
estimators are described elsewhere; the key issue is how ple, Raudenbush and Bryk 2002). This is quite a wide
to incorporate the estimates of variation into the analysis. spread in population correlation values.
A straightforward approach is to estimate the diagonal 20.3.4.5 Random-Effects Mean Correlation If the
matrix TZ  diag(12, 22, . . . , p*2) and add T Z to each of the meta-analyst believes that a random-effects model is more
RE 
, so that T Z. Then RE (the estimate of the appropriate for the data, then, as just noted, the value
REthe variance including between-studies differ-
i i i
random-effects covariance matrix with RE matrices on of
i
in formulas enceswould be used in place of in (5). Then the ran-
its diagonal) would be used in place of
20.5 and 20.6, and in formulas 20.10 and 20.11. When dom-effects mean RE would be computed as
analyses are conducted on raw correlations, variance
components are estimated in the r metric (I will call the RE  (X[
RE]1 X)1 X[
RE]1 z, (20.10)
MODEL-BASED META-ANALYSIS 391

where X is still the stack of k p*  p* indicator matrices, Values shown as 0 were not exactly zero, but were smaller
z is the vector of z-transformed correlations, and RE is a than 0.0001.
blockwise diagonal matrix with the matrices RE on its 20.3.4.6 Test of No Association, with H02:  0 Syn-
i
diagonal. As for the fixed-effects estimate, the values in thesists often want to test whether any of the correlations
RE would be transformed to the correlation metric for re- in the matrix  are nonzero, that is, to test H02:  0. GLS
porting. The variance of RE is theory provides such a test for the correlation vector. That
test, however, has been shown to reject H01 up to twice as
RE]1 X)1.
Var( RE)  (X[ (20.11) frequently as it should for the nominal alpha (Becker and
Fahrbach 1994). If we are working with Fisher z trans-
Typically the variances obtained using formula 20.11 will formed data, we can test the hypothesis that  0, which
be larger than those using 20.6. is logically equivalent to the hypothesis H02:  0 be-
Similarly, we can estimate the mean in the correlation cause  0 implies  0. The test uses the statistic
metric, as
QBZ 
(X
1 X)1 .
(20.14)
RE]1 X)1 X[
RE  (X[ RE]1 r, (20.12)
When the null hypothesis is true, QBZ has approximately
where X is the stack of k p*  p* indicator matrices, B is a chi-square distribution with p* degrees of freedom. I and
the vector of correlations, and RE is the variance of r
Fahrbach found that the rejection rate of QBZ for
 .05
under random effects. The variance of RE is was never more than .07, but was usually within .01 of
RE]1 X)1.
 .05, and when the withinstudy sample size was 250,
Var( RE)  (X[ (20.13)
dropped below .05 in some cases (1994). Rejecting H02
implies that at least one element of the matrix is non-
Example. In the data from our example, the first three
zero, but does not mean that all elements are nonzero.
correlations showed significant variation. Now, under the
A similar test for a subset of correlations can be ob-
random-effects model, they are slightly smaller than under
tained by selecting a submatrix of m values from and
the fixed-effects assumptions. The random-effects mean,
using the corresponding submatrices of X and to com-
computed using z values then transformed, is
pute QBZ for the subset of correlations.
1 .040 .075 .252 If one has adopted a random-effects model on the basis
.041 1 .548 .454
of theory or because of rejecting either of the homogene-
% RE . ity test criteria, then one would substitute RE for in
.075 .548 1 .409 20.14 and use the random-effects mean RE in place of .

.252 .454 .409 1 Typically, the test computed under these conditions will
be smaller than the value defined in 20.14. Effects are
The mean in the correlation metric is quite similar: thus less likely to be judged significant in the context of
the random-effects model.
1 .041 .082 .241 Example. Under both the fixed and random-effects
.041 1 .523 . 417 models, the value of QBZ is quite large. Under the more
RE . appropriate random-effects model, QBZ  402.44 with
.082 .523 1 .418 df  6, so at least one population correlation differs from

.241 .4177 .418 1 0. However, individual random-effects tests of the sig-
nificance of the first two correlations, of cognitive and
The variance of RE (used below) is somatic anxiety with performance, are not significant
.0339 .0033 .0017 0 .0002 .0001 z  0.29 for the cognitive anxiety-performance correla-
.0033 .0295  .0011 0 .00011 . 0002
tion and z  0.76 for somatic anxiety with performance.
20.3.4.7 Estimating Linear Models In many model
.0017 .0011 .0087 0 .0001 .0001
Var(RE )  . driven meta-analyses, synthesists will also want to esti-
0 0 0 .0023  .0004  .0011
.0002 mate path models such as those shown in figures 20.3 and
.0001 .0001 .0004 .0303 .0022 ~RE and its
20.4. We draw on either RE and its variance or
.0001 .0002 .0001 .0011 .0022 .0048
variance for our computations. Let us consider a model
392 SPECIAL STATISTICAL ISSUES AND PROBLEMS

to be estimated under random-effects assumptions. Here Cognitive Somatic


I use RE, the mean based on correlations, to avoid using (X1) (X2)
the approximate inverse variance transformation for .32* .23*
Var(~RE). We begin by arranging the mean correlations in
__ __
RE into a square form, which we call R. The matrix R is

partitioned so that the mean correlations of the predictors


with __the outcome of interest are put into a submatrix we Self
call RXY and__the intercorrelations among
__ the predictors confidence
are put into RXX. Different subsets of R are used depend- (X3)
ing on the model or models to be estimated.
Suppose that we are working with the 4  4 matrix of .10 .29* .01
means from the Craft et al. data, and the matrix is ar-
ranged so that correlations involving sport performance
are in the first row and column. We would partition the
matrix to estimate__a model with performance as the out-
come, as follows. RXY would __ be the vector of three values (Y )
(.040, .075, .252) and RXX is the lower square matrix:
Sport performance
1 .040 .075 .252
.040 1 .548 .454 Figure 20.6 Random-Effects Path Coefficients for Predic-
R tion of Sport Performance
.075 .548 1 .409
source: Authors compilation.
.252 .454 .4409 1
1 R XY appears to be a significant direct predictor of sport behav-
ior under the random-effects model. Dashed lines repre-
 . sent paths that were tested but found not significant. How-
RYX R XX
ever, both cognitive and somatic aspects of anxiety appear
to have indirect effects on self confidence. Both are sig-
__ __ nificant predictors of self confidence, even under the ran-
The product b*  R1 XY RXY is then computed to obtain dom-effects model.
the standardized regression slopes in the structural model. In general, when one presents more information than
Schram and I provided further details about the estima- just the path coefficients and some indication of their sig-
tion of variances and tests associated with such models nificance on path diagrams, the images get very busy.
(Becker and Schram 1994). __ The variance of b* is ob- Thus it is not a good idea to add standard errors and the
tained as a function of Var(R)  Var( RE) that depends on like. In the original Craft et al. meta-analysis, tables of
the specific model being estimated. slope coefficients and their standard errors, and z tests
Cheung and Chan noted that multilevel structural equa- computed as the ratios of these quantities, were presented
tion modeling programs can be used if there is no interest (2003). This information can also be presented in text,
in obtaining the mean correlation matrix as an intermedi- such as by discussion of the slopes. In describing this
ate step (2005). These analyses, however, assume that a model, the meta-analyst might mention that a one stan-
fixed-effects model applies. If that is not likely to be true, dard-deviation shift in self confidence has nearly a third
using such a model may not be advisable. Also their ap- of a standard deviation impact on sport performance (with
proach does not appear to incorporate the variances of the a slope of .29, SE  .115, and z  2.43, p  .011). Other
mean correlations, thus the standard errors from their ap- slopes would also be discussed similarly.
proach may be underestimated. 20.3.4.8 Moderator Analyses A last step one might
Example. I have estimated the path diagram shown in take in analyzing a model in a meta-analysis is to exam-
figure 20.3 for the example data on anxiety and sport be- ine categorical moderator variables. To do this, separate
havior, under random-effects assumptions. For this sub- models are estimated for the groups of interest and com-
set of the original Craft et al. data, only self confidence pared. For the example data set, the moderator I examine
MODEL-BASED META-ANALYSIS 393

Cognitive Somatic which in large samples is approximately normal, and thus


(X1) .31* Team .09 (X2) can be compared to the critical values of the standard nor-
mal distribution. For a two-sided test at the .05 level, this
.24* Individual .36*
difference would clearly be significant, leading us to de-
cide to reject the hypothesis that T*  I* for this rela-
tionship. Self-confidence appears to be more important to
Self performance for individual sports than team sports. The
confidence slope representing its contribution to performance also
(X3) differs from zero for the individual-sport studies, but not
for team sports.
Team .02 .06 .08
Individual .12* .39* .05 20.4 SUMMARY AND FUTURE POSSIBILITIES
Although the steps and data required to implement a
model driven meta-analysis are definitely more complex
(Y) and involved than those for a typical univariate meta-anal-
Sport performance ysis, if the questions of interest are suitable, the benefits
of the model-driven analyses are well worth the effort. I
Figure 20.7 Random-Effects Path Coefficients by Type of have illustrated the strengths of model based analyses for
Sport examining such complexities as partial relationships, me-
source: Authors compilation. diating variables, and indirect effects. The small data set
used for the examples did not allow for a good illustration
of the ways model-driven meta-analysis can help to iden-
tify areas in need of further exploration, but some of the
is the type of sport activityspecifically team sports
examples drawn from the literature make clear that even
(studies with Type 1 in table 20.1) versus individual
when data are sparse, much can be learned.
sports (Type  2). Four team-sport studies and six indi-
There is still work to be done in this area, in spite of the
vidual-sport studies are in the example data. Figure 20.7
good number of dissertations that have addressed ques-
shows the slopes from the two samples. The arrow end-
tions about estimation methods and the impact of missing
points distinguish differences in significance of relations
data in this complicated realm. The development of tests
for the two groups, with curved arrows for team sports,
of indirect effects as well as overall goodness of fit tests
square point arrows for individual sports, and open arrow
would be of value, though Cheung and Chan argued that
points when both groups show similar relationships.
one can get such tests by using multilevel structural equa-
Again, the dashed lines represent paths with coefficients
tion modeling packagesat least when the constraints of
that do not differ from zero.
those packages, such as belief in the fixed-effects model,
Even though the data set is relatively small, some dif-
are reasonable (2005).
ferences are apparent in figure 20.7. Overall, it seems that
Finally, the synthesist may ask whether the model-
the anxiety measures do not predict sport performance
driven meta-analysis approach of estimating a regression
well for team sports, whereas most aspects of anxiety are
model from a summary of correlation matrices is prefer-
important for individual sports, with either direct or indi-
able to a direct synthesis of regression models. In a sepa-
rect effects or both. I discuss only one comparison. The
rate study, a colleague and I have outlined some of the
contribution of self confidence to performance appears to
complications that arise when summarizing sets of re-
be quite different for the two types of sports. Because the
gression models (Becker and Wu 2007). Some of those
studies are independent, we can construct a test of differ-
complications are avoided with the model-driven meta-
ences in the slopes as
analysis approach. In particular, to summarize raw re-
(b*  b*) (.06  (.39)) gression coefficients, one must be sure that the predictors
z  _________________
T I
_______________  _____________
____________
Var(bT*)  Var(b*)
I .0049  .0023 and outcome are either measured using the same scale
.33 across studies, or that the slopes can be transformed to be
 _____
.085  3.89, on a common standardized scale. Because correlations
394 SPECIAL STATISTICAL ISSUES AND PROBLEMS

are scale-free, this issue is skirted in model-driven meta- Studies Using Diverse Measures. Ph.D. diss., Michigan
analysis. State University.
More problematic is that technically, for raw slopes or Aloe, Ariel M., and Betsy J. Becker. In press. Advances in
even standardized slopes to be comparable, the same set Combining Regression Results. In The SAGE Handbook
of predictors should appear in all of the regression mod- of Methodological Innovation, edited by Malcolm Williams
els to be combined. This is almost never the case, given and Paul Vogt. London: Sage Publications.
that researchers typically build on and add to models al- Becker, Betsy J. 1992a. Models of Science Achievement:
ready appearing in the literature, thus making exact repli- Factors Affecting Male and Female Performance In School
cations very rare. Wu and I have shown the conditions Science. In Meta-Analysis for Explanation: A Casebook,
under which a summary of slopes from identical models edited by Thomas D. Cook, Harris M. Cooper, David S.
will be equivalent to the same regression model computed Cordray, Larry V. Hedges, Heidi Hartmann, Richard J.
from the pooled primary data (Becker and Wu 2007). It is Light, Thomas A. Louis, and Frederick Mosteller. New
unclear, however, how close the summary of slopes and York: Russell Sage Foundation.
the pooled analysis results would be if the models were . 1992b. Using Results from Replicated Studies to
not identical across studies. One might conjecture that the Estimate Linear Models. Journal of Educational Statistics
summary result will be closer to the raw-data analysis the 17(4): 34162.
less correlated the predictors are, but this problem is still . 1995. Corrections to Using Results from Replica-
under study and thus making strong statements is not yet ted Studies to Estimate Linear Models. Journal of Educa-
advisable. tional Statistics 20(1): 1002.
One last development related to the issue of combining . 2000. Multivariate Meta-analysis. In Handbook
regressions addresses the question of whether one could of Applied Multivariate Statistics and Mathematical Mode-
directly summarize an index of partial relationship based ling, edited by Howard E. A. Tinsley and Steven D. Brown.
on a regression analysis. Ariel Aloe and I have proposed San Diego, Calif.: Academic Press.
using the semi-partial correlation for this purpose (in . 2001. Examining Theoretical Models Through Re-
press). This index can be computed from information search Synthesis: The Benefits of Model-Driven Meta-
typically presented in multiple regression studies (spe- Analysis. Evaluation and the Health Professions 24(2):
cifically, from a t test of the slope and the multiple R2 for 190217.
the equation), and does not require the full correlation Becker, Betsy J., and Kyle R. Fahrbach. 1994. A Compari-
matrix among predictors to be reported. Its one drawback son of Approaches to the Synthesis of Correlation Matrices.
is that when studies use different sets of predictors, as Paper presented at the annual meeting of the American Edu-
discussed earlier, the index estimates different partial re- cational Research Association. New Orleans, La. (1994).
lationships across studies, and its interpretation may de- Becker, Betsy J., and Ingram Olkin. 2008. Combining
pend on the particular set of predictors examined in each Correlation Matrices from Multiple Studies. Manuscript
study. Although work on this index is in its early stages, under review for Journal of Educational and Behavioral
it holds some promise for summaries of regression stud- Statistics.
ies that do not report zero order correlations. Becker, Betsy J., and Christine M. Schram. 1994. Exami-
ning Explanatory Models Through Research Synthesis. In
20.5 ACKNOWLEDGMENTS The Handbook of Research Synthesis, edited by Harris M.
Cooper and Larry V. Hedges. New York: Russell Sage
This work was supported by grant REC-0634013 from
Foundation.
the National Science Foundation to Betsy Becker and In-
Becker, Betsy J., and Meng-Jia Wu. 2007. The Synthesis of
gram Olkin. Any opinions, findings, and conclusions or
Regression Slopes in Meta-Analysis. Statistical Science
recommendations expressed in this material are those of
22 (3): 41429.
the author(s) and do not necessarily reflect the views of
Beery, John R. 1962. Does Professional Preparation Make a
the National Science Foundation.
Difference? Journal of Teacher Education 13(4): 38695.
20.6 REFERENCES Brown, Sharon A., and Larry V. Hedges. 1994. Predicting
Metabolic Control in Diabetes: A Pilot Study Using Meta-
Ahn, Soyeon. 2008. Application of Model-Driven Meta- Analysis to Estimate a Linear Model. Nursing Research
Analysis and Latent Variable Framework in Synthesizing 43(6): 36268.
MODEL-BASED META-ANALYSIS 395

Cheung, Mike W. L., and Wai Chan. 2005. Meta-Analytic Kaplan, David. 2000. Structural Equation Modeling: Foun-
Structural Equation Modeling: A Two-Stage Approach. dations and Extensions. Newbury Park, Calif.: Sage Publi-
Psychological Methods 10(1): 4064. cations.
Collins, David, Knowlton W. Johnson, and Betsy J. Becker. Lipsey, Mark W. 1997. Using Linked Meta-Analysis to
2007. A Meta-Analysis of Effects of Community Coalitions Build Policy Models. In Meta-Analysis of Drug Abuse
Implementing Science-Based Substance Abuse Prevention Prevention Programs, edited by William J. Bukoski. NIDA
Interventions. Substance Use and Misuse 42(6): 9851007. Research Monograph 170. Rockville, Md.: National Insti-
Craft, Lynette L., T. Michelle Magyar, Betsy J. Becker, and tute of Drug Abuse.
Debra L. Feltz. 2003. The Relationship Between the Com- Lipsey, Mark W., and James H. Derzon. 1998. Predictors of
petitive State Anxiety Index-2 and Athletic Performance: A Violent or Serious Delinquency in Adolescence and Early
Meta-analysis. Journal of Sport and Exercise Psychology Adulthood: A Synthesis of Longitudinal Research. In Se-
25(1): 4465. rious and Violent Juvenile Offenders: Risk Factors and Suc-
Eccles-Parsons, Jacquelynne. 1984. Sex Differences in Ma- cessful Interventions, edited by Rolf Loeber and David P.
thematics Participation. In Advances in Achievement and Farrington. Thousand Oaks, Calif.: Sage Publications.
Motivation, edited by Marjorie W. Steinkamp and Martin L. MacKinnon, David P., Chondra M. Lockwood, Jeanne M.
Maehr. Greenwich, Conn.: JAI Press. Hoffman, Stephen G. West, and Virgil Sheets. 2002. A
Edwards, Tara, and Lew Hardy. 1996. The Interactive Ef- Comparison of Methods to Test Mediation and Other In-
fects of Intensity and Direction of Cognitive and Somatic tervening Variable Effects. Psychological Methods 7(1):
Anxiety and Self-Confidence Upon Performance. Journal 83104.
of Sport and Exercise Psychology 18(3): 296312. Martens, Rainer, Robin S. Vealey, and Damon Burton, eds.
Fisher, Ronald A. 1921. On the Probable Error of a Coef- 1990. Competitive Anxiety in Sport. Champaign, Ill.: Human
ficient of Correlation Deduced from a Small Sample. Me- Kinetics.
tron 1: 132. Olkin, Ingram, and Minoru Siotani. 1976. Asymptotic Dis-
Furlow, Carolyn F., and S. Natasha Beretvas. 2005. Meta- tribution of Functions of a Correlation Matrix. In Essays
Analytic Methods of Pooling Correlation Matrices for in Probability and Statistics, edited by S. Ikeda et al. Tokyo:
Structural Equation Modeling Under Different Patterns of Shinko Tsusho Co., Ltd.
Missing Data. Psychological Methods 10(2): 227254. Raudenbush, Stephen W., and Anthony S. Bryk. 2002. Hie-
Glass, Gene V. 1976. Primary, Secondary, and Meta-analy- rarchical Linear Models: Applications and Data Analysis
sis of Research. Educational Researcher 5(10): 38. Methods. Thousand Oaks, Calif.: Sage Publications.
Goldhaber, Dan D., and Dominic J. Brewer. 1999. Teacher Raudenbush, Stephen W., Betsy Jane Becker, and Hripsime
Licensing and Student Achievement. In Better Teachers, Kalaian. 1988. Modeling Multivariate Effect Sizes. Psy-
Better Schools, edited by Marci Kanstoroom and Chester chological Bulletin 103(1): 11120.
F. J. Finn. Washington, D.C.: Thomas B. Fordham Foun- Sidik, Kurek, and Jeffrey N. Jonkman. 2005. Simple Hete-
dation. rogeneity Variance Estimation for Meta-Analysis. Applied
Hafdahl, Adam R. 2001. Multivariate Meta-Analysis for Statistics 54(2): 36784.
Exploratory Factor Analytic Research. PhD diss., The Thum, Yeow Meng, and Soyeon Ahn. 2007. Challenges of
University of North Carolina at Chapel Hill. Meta-Analysis from the Standpoint of Latent Variable Fra-
Hedges, Larry V., and Jack L. Vevea. 1998. Fixed- and Ran- mework. Paper presented at the 7th Annual Campbell Col-
dom-Effects Models in Meta-Analysis. Psychological laboration Colloquium. London (May 1417, 2007).
Methods 3(4): 486504. Whiteside, Mary F., and Betsy J. Becker. 2000. The Young
Hodges, William F. 1991. Interventions for Children of Di- Childs Post-Divorce Adjustment: A Meta-Analysis with
vorce, 3rd ed. New York: John Wiley & Sons. Implications For Parenting Arrangements. Journal of Fa-
Judd, Charles M., and David A. Kenny. 1981. Process Eva- mily Psychology 14(1): 122.
luation: Estimating Mediation in Treatment Evaluations. Wu, Meng-Jia. 2006. Methods of Meta-analyzing Regres-
Evaluation Review 5(5): 60219. sion Studies: Applications of Generalized Least Squares
Kalaian, Hripsime A., and Stephen W. Raudenbush. 1996. and Factored Likelihoods. PhD diss., Michigan State Uni-
A Multivariate Mixed Linear Model for Meta-Analysis. versity.
Psychological Methods 1(3): 22735.
21
HANDLING MISSING DATA

THERESE D. PIGOTT
Loyola University, Chicago

CONTENTS
21.1 Introduction 400
21.2 Types of Missing Data 400
21.2.1 Missing Studies 400
21.2.2 Missing Effect Sizes 400
21.2.3 Missing Descriptor Variables 401
21.3 Reasons for Missing Data in Meta-Analysis 402
21.3.1 Missing Completely at Random 402
21.3.2 Missing at Random 403
21.3.3 Not Missing at Random 403
21.4 Commonly Used Methods for Missing Data in Meta-Analysis 403
21.4.1 Complete Case Analysis 403
21.4.1.1 Complete Case Analysis Example 404
21.4.2 Available Case Analysis 404
21.4.2.1 Available Case Analysis Example 406
21.4.3 Single-Value Imputation with the Complete Case Mean 406
21.4.3.1 Imputing the Complete Case Mean 406
21.4.3.1.1 Complete Case Mean Imputation Example 407
21.4.3.2 Single-Value Imputation with Conditional Means 407
21.4.3.2.1 Regression Imputation Example 408
21.4.4 Missing Effect Sizes 408
21.4.5 Summary of Simple Methods for Missing Data 409
21.5 Model-Based Methods for Missing Data 409
21.5.1 Maximum Likelihood Methods Using the EM Algorithm 409
21.5.1.1 Maximum Likelihood Methods 410
21.5.1.2 Example of Maximum Likelihood Methods 411
21.5.1.3 Other Applications of Maximum Likelihood Methods 411

399
400 DATA INTERPRETATION

21.5.2 Multiple Imputation 412


21.5.2.1 Multiple Imputation and Missing Data 413
21.5.2.2 Example of Multiple Imputation 413
21.6 Recommendations 413
21.7 References 415

21.1 INTRODUCTION descriptor variables. Although the reasons for missing


observations on any of these three areas vary, each type of
This chapter discusses what researchers can do when stud- missing data presents difficulties for the synthesists.
ies are missing the information needed for meta-analysis.
Despite careful evaluation of coding decisions, research- 21.2.1 Missing Studies
ers will find that studies in a research synthesis invariably
differ in the types and quality of the information reported. A number of mechanisms lead to studies missing in a re-
Here I examine the types of missing data that occur in a search synthesis. Researchers in both medicine and in the
research synthesis, and discuss strategies synthesists can social sciences have documented the bias in published
use when faced with missing data. literature toward statistically significant results (for ex-
Problems caused by missing data can never be entirely ample, Rosenthal 1979; Hemminki 1980; Smith 1980;
alleviated. As such, the first strategy for addressing miss- Begg and Berlin 1988). In this case, studies that find non-
ing data should be to contact the study authors. Doing so significant statistical results are less likely to appear in
remains a viable strategy in some cases. When it fails, the the published literature (see chapters 6 and chapter 23,
next strategy is to use statistical missing data methods to this volume). Another reason studies may be missing in a
check the sensitivity of results to different assumptions synthesis is lack of accessibility; some studies are unpub-
about the distribution of the hypothetically complete data lished reports that are not identifiable through commonly
and the reasons for the missing data. We can think of used search engines or are not easily accessed by synthe-
these methods as a trade-offif observations are missing sists. For example, Matthias Egger and George Davey
in the data, then the cost is the need to make assumptions (1997) and Peter Jni et al. (2002) demonstrated that stud-
about the data and the mechanism that causes the missing ies published in languages other than English may have
data, assumptions that are difficult to verify in practice. different results than those published in English.
Synthesists can feel more confident about results that are Researchers undertaking a comprehensive synthesis
robust to different assumptions about the missing data. expend significant effort identifying and obtaining unpub-
On the other hand, results that differ depending on the lished studies to maintain the representative nature of the
missing data assumptions may indicate that the sample of sample for the synthesis. Strategies for preventing, assess-
studies do not provide enough evidence for robust infer- ing, and adjusting for publication bias are examined in
ence. Joseph Schafer and John Graham pointed out that chapter 23 of this volume, and by Hannah Rothstein, Al-
the main goal of statistical methods for missing data is exander Sutton, and Michael Borenstein (2005). Because
not to recover or estimate the missing values but to make publication bias is treated more thoroughly elsewhere, I
valid inferences about a population of interest (2002). focus here on missing data within studies, and not on
They thus noted that the missing data method is embed- methods for adjusting analyses for publication bias.
ded in the particular model or testing procedure the ana-
lyst is using. My goal in this chapter is to introduce meth- 21.2.2 Missing Effect Sizes
ods for estimating effect size models when either effect
sizes or predictors in the model are missing from primary A common problem in research syntheses is missing in-
studies. formation for computing an effect size. For example, An-
Wen Chan and colleagues found that 50 to 65 percent of
21.2 TYPES OF MISSING DATA the outcomes in a set of 102 published clinical trials were
incompletely reported (2004). Missing effect sizes occur
Synthesists encounter missing data in three major areas: when studies are missing the descriptive statistics needed
missing studies, missing effect sizes, and missing study to compute an effect size and the effect sizes standard
HANDLING MISSING DATA 401

error. The typical information needed includes simple de- interested in subgroup analyses in a given study. For ex-
scriptive statistics such as the means, standard deviations, ample, Hahn and colleagues found that in a set of studies
and sample sizes for standardized mean differences, the on the use of routine malaria prophylaxis in pregnant
correlation and sample size for correlations, or the fre- women living in high-risk areas, only one study com-
quencies in a two by two table for odds ratios. pared the benefits of treatment among first-time pregnant
In some studies, the effect sizes are missing because mothers versus mothers who had a prior pregnancy. They
the primary researcher did not report enough information referred to this issue as selective reporting within a study.
to compute an effect size. Chan and colleagues referred Though the researchers talked about this problem as
to these studies as including incompletely reported out- missing effect sizes within studies, it can also be concep-
comes (2004), making a distinction between various cat- tualized as a problem of a missing descriptor variable
egories of incompletely reported outcomes. Studies with only one study in the Hahn et al. example reports on
partially reported outcomes usually include the value of whether participants were pregnant for the first time. This
the effect size and may report the sample size or p-value, chapter treats selective reporting of outcomes in a study
or both. Studies with qualitatively reported outcomes as a missing descriptor variable problem instead of as a
may include only the p-value, with or without the sample missing outcome. In some cases, the primary researcher
size. In these two cases of incompletely reported out- may not have collected information on this predictor. In
comes, an estimated effect size could be computed based others, however, the researcher may have collected the
on the p-value and sample sizes. William Shadish, Leslie information, but not reported the results. Paula William-
Robinson, and Congxiao Lus effect size calculator com- son and her colleagues found that one reason for not re-
putes effect sizes from more than forty types of data, in- porting subgroup analyses (or, alternatively, descriptor
cluding p-values, sample sizes, and statistical tests (1997). variables) is that the results for these analyses are not sta-
As journals begin requiring the use of effect sizes in re- tistically significant (2005). As will be seen later, if de-
porting primary study results, problems with computing scriptor variables are missing because they are not statis-
effect sizes may decrease. tically related to the outcome within a study, the synthesist
The more difficult missing outcome category is Chan must use missing data methods that require stronger as-
et al.s unreported outcomes, where no information is sumptions and more complex analyses.
given, leading to a missing effect size. Missing outcomes
in any statistical analysis, and missing effect sizes in a re- 21.2.3 Missing Descriptor Variables
search synthesis pose a difficult problem. When a measure
of outcome is missing for a case in any statistical analysis, A major task in the data evaluation phase of a meta-anal-
the case provides limited information for estimating sta- ysis involves coding aspects of a studys design. Because
tistical models and making statistical inferences. most studies in a given area are not strict replications of
One primary reason for missing effect sizes relates to earlier work, studies differ not only in their research
the effect size itself. A primary research study may not design, but also in their data collection strategies and re-
provide descriptive statistics for an outcome that did not porting practices. Studies may be missing particular de-
yield a statistically significant result. Chan and his col- scriptor variables because certain information was not
leagues found that the odds of an outcome being com- collected or was not reported. Missing descriptor vari-
pletely reported, with enough information to compute an ables are an inherent problem in meta-analysis; not every
effect size, was twice as high if that outcome was statisti- study will collect the same data. In the Seokyung Hahn et
cally significant (2004). As discussed later in the chapter, al. paper, not all studies reported (or collected) informa-
when outcomes are missing because of their actual val- tion on whether the participants had had a prior preg-
ues, the options for analysis require strong assumptions nancy (2000). If the synthesist is interested in modeling
and more complex statistical procedures. how the effect of prophylactic malaria treatment differs
Incomplete reporting on outcomes can occur at the as a function of whether a woman is pregnant for the first
level of the primary outcome for a study, or in subgroup time, those studies not reporting the number of previous
analyses within a study (Hahn et al. 2000). If a study does pregnancies are missing a key predictor.
not include information on the primary outcomethe As I discuss later, statistical methods for missing data
primary focus of the meta-analysisthen the study is methods require particular assumptions about why vari-
missing an effect size. However, a synthesist may also be ables are missing. Thus it is important to explore possible
402 DATA INTERPRETATION

reasons for missing descriptor variables in a primary of the relationship between the variable and the outcome
study. (2005).
In one sense, missing data on study descriptors arise Researchers may also avoid reporting study descrip-
because of differences synthesists and primary research- tors when those values are not generally acceptable. For
ers place on the importance of particular information. example, if a researcher obtains a racially homogeneous
Synthesists may take a wider perspective on an area of sample, the researcher may hesitate to report fully the ac-
research and primary authors may have many constraints tual distribution of ethnicity in the sample. This type of
in both designing and reporting study results. selective reporting may also occur when reporting in a
For example, disciplinary practices in the field may study is influenced by a desire to clean up the results.
dictate how primary authors collect, measure, and report For example, attrition information is missing in a pub-
certain information. Robert Orwin and David Cordray lished study to present a more positive picture. In both of
used the term macrolevel reporting to refer to practices these cases, the values of the study descriptor influence
in a given research area that influence how constructs whether the researcher reports those values.
are defined and reported (1985). For example, Selcuk
Sirin found several measures of socioeconomic status re- 21.3 REASONS FOR MISSING DATA
ported in his meta-analysis of the relationship between IN META-ANALYSIS
academic achievement and socioeconomic status (2005).
These measures included parent scores on Hollingsheads Statisticians have been studying methods for handling
occupational status scale, parental income, parent educa- missing data for many years, especially in the area of
tion level, free lunch eligibility in school, and composites survey research. Since the publication of Roderick Little
of these measures. Differences in types of socioeconomic and Donald Rubins seminal book (1987), statisticians
status (SES) measures reported could derive from tradi- have developed many more sophisticated frameworks
tional ways disciplines have reported SES. In this case, and analysis strategies. These strategies all rely on spe-
differences in reporting between primary studies may re- cific assumptions the analyst is willing to make about the
late to the disciplinary practices in the primary authors reasons for missing data, and about the joint distribution
field. of all variables in the data. Rubin first introduced a frame-
Primary authors also differ from each other in writing work for describing the response mechanism, the reasons
style and thoroughness of reporting. Orwin and Cordray why data may be missing in a given research study (1976).
referred to individual differences of primary authors as This framework describes data as missing completely at
microlevel reporting quality (1985). In this case, whether random, missing at random, and not missing at random.
a given descriptor is reported across a set of studies may
be random given that it depends on individual writing 21.3.1 Missing Completely at Random
preferences and practices.
Primary authors may also be constrained by a particu- We refer to data as missing completely at random
lar journals publication practices, and thus do not report (MCAR) when the missing data mechanism results in
on information a synthesist considers important. Individ- observations missing randomly across the data set. The
ual journal reporting constraints would likely result in classic example is the missing pattern created when ink is
descriptors missing randomly rather than missing because accidentally spilled on a paper copy of a spreadsheet. Al-
their values were not acceptable. though we are no longer plagued by ink spills, other mech-
Another reason study descriptors may be missing from anisms could be imagined to operate resulting in MCAR
primary study reports relates to bias for reporting only sta- data. For example, microlevel reporting practices may
tistically significant results. If, for example, gender does cause primary authors to differ from each other about the
not have a significant relationship to a given outcome information presented in a study, and the result of those
(such as SES), then a primary author may not report the differences could be values of a particular variable miss-
gender distribution of the sample. Williamson and her col- ing in a completely random way. When data are MCAR,
leagues provided a discussion of this problem, raising the we assume that the values of the missing variable are not
possibility that the likelihood of a study reporting a given related to the probability that they are missing, and that
descriptor variable is related to the statistical significance the values of other variables in the data set are also not
HANDLING MISSING DATA 403

related to the probability the variable has missing values. example, data are not missing at random due to a censor-
More important, when data are MCAR, the studies with ing mechanism which results in certain values having
completely observed data are a random sample of the a higher probability of being missing than other values.
studies originally identified for the synthesis. Meta-analysts have developed methods to handle cen-
sored effect size data in the special case of publication
21.3.2 Missing at Random bias (for example, Hedges and Vevea 1996; Vevea and
Woods 2005; chapter 23, this volume).
We refer to data as missing at random (MAR) when Missing study descriptors could also be NMAR. Pri-
the probability of missing a value on a variable is not re- mary authors whose research participants are racially ho-
lated to the missing value, but may be to other completely mogeneous may not report fully on the distribution of
observed variables in the data set. For example, Sirin re- ethnicities among participants as a way to provide a more
ported on a number of moderator analyses in his meta- positive picture of their study. As with the effect size ex-
analysis of the relationship between SES and academic ample, the actual ethnic distribution of the samples are
achievement (2005). One moderator examined is type of related their probability of being missing.
components of measured SES. These include parental ed- Synthesists rarely have direct evidence about the rea-
ucation, parental occupation, parental income, and eligi- sons for missing data and need to assume data are MCAR,
bility for free or reduced lunch. A second is source of the MAR, or NMAR if they are to choose an appropriate anal-
information on SES, whether parent, student or a second- ysis that will not lead to inferential errors, or else to con-
ary source. Imagine if all studies report the source of in- duct sensitivity analyses to different assumptions about
formation on SES, but not all report the component used. the reasons for the missing data. In the rest of the chap-
It is likely that the source of information on SES is highly ter, I discuss the options available to meta-analysts when
related to the component of SES measured in the study. faced with various forms of missing data.
Students are much less likely to report income, but may
be more likely to report on parental education or occupa- 21.4 COMMONLY USED METHODS FOR MISSING
tion. Secondary sources of income are usually derived DATA IN META-ANALYSIS
from eligibility for free or reduced lunch. In this case, the
component of SES is MAR because we can assume that Before Little and Rubins (1987) work, most researchers
the likelihood of observing any given component of SES used one of three strategies to handle missing data: using
depends on the completely observed value, source of SES only cases with all variables completely observed (list-
information. Note that MAR refers to what is formally wise deletion), using available cases that have particular
called the response mechanism. This is generally unob- pairs of variables observed (pairwise deletion), or replac-
servable, unless a researcher can gather direct information ing missing values in a given variable with a single value
from participants about why they did not respond to a such as the mean for the complete cases (single value im-
given survey question, or from primary authors about why putation). The performance of these methods depends on
they did not collect a particular variable. The assumption the assumptions about why the data are missing. In gen-
of MAR does not depend on measuring variables without eral, the methods produce unbiased estimates only when
error; it specifies only the assumed relationship between data are missing completely at random. The main prob-
the likelihood of observing a particular variable given an- lem is that the standard errors do not accurately reflect the
other completely observed variable in the data set. fact that variables have missing observations.

21.3.3 Not Missing at Random 21.4.1 Complete-Case Analysis


We refer to data as not missing at random (NMAR) when In complete-case analysis, the researcher uses only those
the probability of observing a given value for a variable cases with all variables fully observed. This procedure,
is related to the missing value itself. For example, effect also known as listwise deletion, is usually the default
sizes are not missing at random when they are not re- procedure for many statistical computer packages. When
ported in a study because they are not statistically signifi- some cases are missing values of a particular variable,
cant (see Chan et al. 2004; Williamson et al. 2005). In this only cases observing all the variables in the analysis are
404 DATA INTERPRETATION

used. When the missing data are missing completely at pares the correlations between these three variables using
random, the complete cases can be considered a random complete case analysis. The MAR estimates for the cor-
sample from the originally identified set of cases. Thus, if relation of the log-odds ratio with the two predictor vari-
a synthesist can make the assumption that values are in ables are closer to the true values than the MCAR esti-
fact missing completely at random, using only complete mates. For the correlation of the intensity of dose with
cases will produce unbiased results. study age, the MCAR and MAR estimates are similar to
In meta-analysis as in other studies, using only com- each other and smaller than the true value.
plete cases can seriously limit the number of observations
available for the analysis. Losing cases decreases the 21.4.2 Available Case Analysis
power of the analysis and ignores the information con-
tained in the incomplete cases (Kim and Curry 1977; Lit- An available case analysis, also called pairwise analysis,
tle and Rubin 1987). estimates parameters using as much data as possible. In
When data are NMAR or MAR, complete case analy- other words, if three variables in a data set are missing 0
sis yields biased estimates because the complete cases percent, 10 percent, and 40 percent of their values, re-
cannot be considered representative of the original sam- spectively, then the correlation between the first and third
ple. With NMAR data, the complete cases observe only variable would be estimated using the 60 percent of cases
part of the distribution of a particular variable. With MAR that observe both variables, and that between the first and
data, the complete cases are also not a random sample second variable would use the 90 percent. An entirely dif-
from a particular variable. With MAR data, though, the in- ferent set of cases would provide the estimate of the cor-
complete cases still provide information because the vari- relation of the second and third variables given the poten-
ables that are completely observed across the data set are tial of having between 40 percent and 50 percent of these
related to the probability of missing a particular variable. cases missing both variables.
21.4.1.1 Complete Case Analysis Example Table This simple example illustrates the drawback of using
21.1 presents a subset of studies from a meta-analysis ex- available case analysis: each correlation in the variance-
amining the effects of oral anticoagulant therapy for pa- covariance matrix estimated using available cases could
tients with coronary artery disease (Anand and Yusuf be based on different subsets of the original data set. If
1999). To illustrate the use of missing data methods with data are MCAR, these subsets are individually represen-
MCAR and MAR data, I used two methods for deleting tative of the original data, and available case analysis pro-
the age of the study in the total data set. For MCAR data, vides unbiased estimates. If data are MAR, however, these
I randomly deleted twelve values (35 percent) of study subsets are not individually representative of the original
age from the data. For MAR data, I randomly deleted data and will produce biased estimates.
studies depending on the value of the intensity of the dose Much of the early research on methods for missing data
of anticoagulant. For studies using a high dose of antico- focuses on the performance of available case analysis
agulant, studies had a ten in twenty-two (45 percent) versus complete case analysis (for example, Glasser 1964;
chance of missing a value. Studies with a moderate dose Haitovsky 1968; Kim and Curry 1977). Kyle Fahrbach
of anticoagulants had a one in nine (11 percent) chance of examined the research on available case analysis and con-
missing a value, and studies with a low dose had a one in cludes that such methods provide more efficient estimators
three (33 percent) chance of missing a value of study age. than complete case analysis when correlations between
Table 21.2 compares the means and standard devia- two independent variables are moderate, that is, around
tions between the log-odds ratio, the intensity of the dose, 0.6 (2001). This view, however, is not shared by all who
and the age of the study when using complete case analy- have examined this literature (for example, Allison 2002).
sis on MCAR or MAR data. The MCAR estimates for the One statistical problem that could arise from the use of
two predictors, intensity of dose and age of study, are available cases under any form of missing data is a non-
closer to the true mean than the MAR estimates. This positive definite variance-covariance matrix, that is, a
finding is reversed for the estimates of the mean log-odds variance-covariance matrix that cannot be inverted to ob-
ratio, where the MAR estimates are closer to the true val- tain the estimates of slopes for a regression model. One
ues. The standard deviation of the MCAR estimate of the reason for this problem is that different subsets of studies
log-odds ratio mean is larger than the MAR estimate, re- are used to compute the elements of the variance-covari-
sulting in a more conservative estimate. Table 21.3 com- ance matrix. Further, Paul Allison pointed out that a more
HANDLING MISSING DATA 405

Table 21.1 Oral Anticoagulant Therapy Data

Odds Log-Odds Variance of Intensity of Year of


ID Ratio Ratio Log-Odds Ratio Dose Publication Study Age

1.00 20.49 3.02 2.21 High 1960 39.002


2.00 0.16 1.84 0.80 High 1960 39.002
3.00 1.27 0.24 0.16 High 1961 38.001
4.00 0.86 0.15 0.07 High 1961 38.002
5.00 0.62 0.47 0.07 High 1964 35.002
6.00 0.68 0.38 0.28 High 1964 35.001
7.00 0.68 0.38 0.18 High 1966 33.002
8.00 0.62 0.47 0.22 High 1967 32.001
9.00 0.78 0.25 0.07 High 1967 32.002
10.00 1.25 0.22 0.10 High 1969 30.001,2
11.00 0.43 0.84 0.08 High 1969 30.001,2
12.00 0.71 0.35 0.04 High 1980 19.002
13.00 0.72 0.33 0.02 High 1990 9.00
14.00 0.89 0.12 0.01 High 1994 5.00
15.00 0.16 1.81 0.82 High 1969 30.001
16.00 0.65 0.43 0.02 High 1974 25.00
17.00 1.20 0.18 0.06 High 1980 19.00
18.00 1.48 0.39 0.07 High 1980 19.00
19.00 0.41 0.90 0.41 High 1993 6.00
20.00 0.52 0.65 0.29 High 1996 3.002
21.00 4.15 1.42 2.74 High 1990 9.00
22.00 0.87 0.14 4.06 High 1990 9.001
23.00 1.04 0.04 1.36 Moderate 1982 17.001
24.00 0.71 0.35 0.35 Moderate 1981 18.001,2
25.00 0.92 0.08 0.03 Moderate 1982 17.00
26.00 0.94 0.06 0.03 Moderate 1969 30.00
27.00 0.65 0.43 0.07 Moderate 1964 35.00
28.00 0.31 1.16 0.93 Moderate 1986 13.00
29.00 0.47 0.75 0.98 Moderate 1982 17.001
30.00 0.45 0.81 0.60 Moderate 1998 1.00
31.00 1.04 0.04 0.82 Moderate 1994 5.001
32.00 0.70 0.35 0.70 Low 1998 1.00
33.00 0.71 0.34 0.06 Low 1997 2.002
34.00 1.17 0.15 0.02 Low 1997 2.001

source: Authors compilation.


1Value deleted for MCAR data example
2 Value deleted for MAR data example

difficult problem in the application of available case anal- cal computing packages calculate available case analysis,
ysis concerns the computation of standard errors of avail- but how standard errors are computed differs widely.
able case estimates (2002). At issue is the correct sample Although available case analysis is easy to understand
size when computing standard errors, given that each pa- and implement, consensus in the literature is scant about
rameter could be estimated with a different subset of data. the conditions where available case analysis outperforms
Some of the standard errors could be based on the whole complete case analysis when data are MCAR. Previous
data set, and others on the subset of studies that observe a research on available case analysis does point to the size
particular variable or pair of variables. Numerous statisti- of the correlations between variables in the data set as
406 DATA INTERPRETATION

Table 21.2 Descriptive Statistics, Complete Table 21.4 Descriptive Statistics, Available
Case Analysis Case Analysis

Variable N Mean SD Variable N Mean SD

Log-odds ratio Log-odds ratio


Full data 34 0.239 0.826 Full data 34 0.239 0.826
MCAR data 22 0.185 0.938 MCAR data 34 0.239 0.826
MAR data 22 0.262 0.634 MAR data 34 0.239 0.826
Intensity of dose Intensity of dose
Full data 34 2.559 0.660 Full data 34 2.559 0.660
MCAR data 22 2.591 0.666 MCAR data 34 2.559 0.660
MAR data 22 2.454 0.671 MAR data 34 2.559 0.660
Study age Study age
Full data 34 20.353 13.112 Full data 34 20.353 13.112
MCAR data 22 19.500 13.780 MCAR data 22 19.500 13.780
MAR data 22 17.000 12.107 MAR data 22 17.000 12.107

source: Authors compilation. source: Authors compilation.

Table 21.3 Correlations, Complete Case Analysis the true values and the complete case values from table
21.3. The correlations involving study age are the same as
Intensity of in table 21.3, and the correlation between the log-odds
Dose & Study Age & Study Age & ratio and intensity of dose is the true value for all three
Log-Odds Log-Odds Intensity of analyses. If we added a second variable with missing ob-
Correlation Ratio Ratio Dose servations, we could have correlations based on several
subsets of the data set.
Full data 0.058 0.049 0.498
MCAR data 0.172 0.158 0.381
MAR data 0.052 0.121 0.369 21.4.3 Single-Value Imputation with the
Complete Case Mean
source: Authors compilation.
When values are missing in a meta-analysis (or any sta-
tistical analysis), many researchers replace the missing
critical to the efficiency of this strategy, but this same re- value with a reasonable value, such as the mean for the
search does not reach consensus about the optimal size of cases that observed the variable. Little and Rubin referred
these correlations. to this strategy as single-value imputation. Researchers
21.4.2.1. Available Case Analysis Example Table commonly use different strategies to fill in missing values
21.4 provides the means and standard deviations using (1987). One method fills in the complete case mean, and
available case analysis; table 21.5 provides the correla- the other uses regression with the complete cases to esti-
tions. The entries in the two are a mix between the com- mate predicted values for missing observations given the
plete case values for the study age, and the entire data set observed values in a particular case.
of values for the log-odds ratio and the intensity of dose. 21.4.3.1 Imputing the Complete Case Mean Re-
When the data are MAR, both the mean and the standard placing the missing values in a variable with the complete
deviation of study age are smaller than the true values case mean of the variable is also referred to as uncondi-
even though fewer cases are used for these estimates. The tional mean imputation. When we substitute a single value
estimate of the mean of study age from the MCAR data, for all the missing values, the estimate of the variance of
however, is close to the true value. The standard deviation that variable is decreased. The estimated variance thus
of study age is slightly larger, reflecting the smaller sam- does not reflect the true uncertainty in the variable. Instead,
ple size. The correlations in table 21.5 are a mixture of the smaller variance wrongly indicates more certainty
HANDLING MISSING DATA 407

Table 21.5 Correlations, Available Case Analysis

Intensity of Dose & Study Age & Study Age &


Correlation Log-Odds Ratio Log-Odds Ratio Intensity of Dose

Full data 0.058 (N  34) 0.049 (N  34) 0.498 (N  34)


MCAR data 0.058 (N  34) 0.158 (N  22) 0.381 (N  22)
MAR data 0.058 (N  34) 0.121 (N  22) 0.369 (N  22)

source: Authors compilation.

Table 21.6 Descriptive Statistics, Mean Imputation value imputation with the complete case mean. The means
for study age under MCAR and MAR data are the same
Variable N Mean SD as in the complete case analysis, but the standard devia-
tions are much smaller than the true values. The estimated
Log-odds ratio correlation between study age and intensity of dose is
Full data 34 0.239 0.826
MCAR data 34 0.239 0.826
also more biased than estimates from listwise or pairwise
MAR data 34 0.239 0.826 deletion. Because we are missing more of the values for
high doses, the correlation between intensity of dose and
Intensity of dose study age is decreased when replacing missing values of
Full data 34 2.559 0.660
MCAR data 34 2.559 0.660
study age with the mean study age.
MAR data 34 2.559 0.660 21.4.3.2 Single-Value Imputation with Conditional
Means A single-value imputation method that provides
Study age less biased results with missing data was first suggested
Full data 34 20.353 13.112
MCAR data 34 19.500 10.992
by S. F. Buck (1960). Instead of replacing each missing
MAR data 34 17.00 9.658 value with the complete case mean, each missing value is
replaced with the predicted value from a regression model
source: Authors compilation. using the variables observed in that particular case as pre-
dictors and the missing variable as the outcome. This
method is also referred to as conditional mean imputation
Table 21.7 Correlations, Mean Imputation
or as regression imputation. For each pattern of missing
Intensity of data, the cases with complete data on the variables in the
Dose and Study Age and Intensity of pattern are used to estimate regressions using the observed
Log-Odds Log-Odds Dose and variables to predict the missing values. The result is that
Correlation Ratio Ratio Study Age each missing value is replaced by a predicted value from
a regression using the values of the observed variables in
Full data 0.058 0.049 0.498 that case. When data are MCAR, each of the subsets used
MCAR data 0.058 0.144 0.307 to estimate prediction equations are representative of the
MAR data 0.058 0.074 0.299 original sample. This method results in more variation than
in unconditional mean imputation because the missing
source: Authors compilation. values are replaced with those that depend on the regres-
sion equation. However, the standard errors using Bucks
about the value. These biases are compounded when the method are still too small because the missing values are
biased variances are used to estimate models of effect replaced with predicted values that lie directly on the re-
size. There are no circumstances under which imputation gression line used to impute the values. In other words,
of the complete case mean leads to unbiased results of the Bucks method yields values predicted exactly by the re-
variances of variables with missing data. gression equation without error.
21.4.3.1.1 Complete Case Mean Imputation Exam- Little and Rubin presented the form of the bias for
ple Tables 21.6 and 21.7 present the results using single- Bucks method and suggest corrections to the estimated
408 DATA INTERPRETATION

variances to account for the bias (1987). If we have two Table 21.8 Descriptive Statistics, Regression
variables, Y1 and Y2, and Y2 has missing observations, Imputation
then the form of the bias using Bucks method to fill in
values for Y2 is given by Variable N Mean SD

(n  n(2))(n 1)122.1 , Log-odds ratio


Full data 34 0.239 0.826
MCAR data 34 0.239 0.826
where n is the sample size, n(2) is the number of cases that MAR data 34 0.239 0.826
observe Y2, and 22.1 is the residual variance from the re-
gression of Y2 on Y1. Little and Rubin provide the more Intensity of dose
Full data 34 2.559 0.660
general form of the bias with more than two variables. MCAR data 34 2.559 0.660
To date, there has not been extensive research on the MAR data 34 2.559 0.660
performance of Bucks method to other more complex
methods for missing data in meta-analysis. One advantage Study age
Full data 34 20.353 13.112
to the method with the corrections Little and Rubin sug- MCAR data 34 19.181 11.359
gested is that the standard errors of estimates reflect the MAR data 34 17.648 10.129
uncertainty in the data, which more simple methods do
not. Bucks method with corrections to the standard er- source: Authors compilation.
rors could thus lead to more conservative and less biased
estimates than complete case and available case methods.
Table 21.9 Correlations, Regression Imputation
SPSS Missing Value Analysis offers regression impu-
tation as an option for filling-in missing observations Intensity of
(Hill 1997). The program does not implement Little and Dose and Study Age and Intensity of
Rubins 1987 corrections to obtain unbiased variance es- Log-odds Log-odds Dose and
timates. Instead, the researcher has the option to add a Correlation Ratio Ratio Study Age
random variate to the imputed value, thus increasing the
variance among imputed observations. An analysis add- Full data 0.058 0.049 0.498
ing a random variate to the imputed value may be an op- MCAR data 0.058 0.128 0.445
tion to consider in a sensitivity analysis of the effects of MAR data 0.058 0.193 0.430
missing data on a reviews results.
21.4.3.2.1 Regression Imputation Example Tables source: Authors compilation.
21.8 and 21.9 present the results using single-value impu-
tation with the conditional mean. In this analysis, esti- are related to the outcome. Even when the assumption of
mates are directly from a regression imputation with no MCAR holds for effect sizes, the studies that do not re-
random variate added. Standard errors are thus not ad- port on the outcome contribute little to estimation proce-
justed and in turn underestimate the variation. In both the dures. With MCAR missing effect sizes, complete case
MCAR and MAR case, the correlation between the inten- analysis will provide the most unbiased results. When
sity of the dose and study age is closer than other analyses outcomes are missing, pairwise deletion does not provide
to the true value. The estimate of the mean study age from any more information than listwise deletion because the
MAR data is only slightly larger than in earlier analyses. relationship of interestthat between the outcome and
predictorscan only be estimated in the cases that ob-
21.4.4 Missing Effect Sizes serve an effect size.
In the case of MAR and NMAR missing effect sizes,
Missing effect sizes pose a more difficult problem in the only other option available is single-value imputation.
meta-analysis. When any outcome is missing for a case in Many meta-analysts have suggested replacing missing ef-
a statistical analysis, that case can add little information fect sizes with zero. The rationale is that because missing
to the estimation of the model. For example, in a regres- effect sizes are likely to be not statistically significant,
sion analysis, a case missing the outcome variable does using zero as a substitute provides only a conservative es-
not provide information about how the predictor variables timate of overall mean effect size. As noted in the earlier
HANDLING MISSING DATA 409

discussion of single-value imputation, replacing missing tation can produce unbiased results. The cost of using
effect sizes with zero results in estimates of variance that these methods lies in the estimation of standard errors.
are too small and tests of homogeneity that also do not For complete case analysis, the standard errors will be
reflect the true variation among effect sizes. larger than those from the hypothetically complete data.
Using regression imputation would seem to provide In available case and conditional mean imputation, the
somewhat better estimates. However, because the missing standard errors will be too small, though those from con-
effect sizes are replaced with predictions from a regres- ditional mean imputation can be adjusted. When data are
sion of effect sizes on observed variables, the imputed ef- MAR or NMAR, none of the simpler methods produce
fect sizes will be consistent with the relationships between unbiased results.
the effect sizes and predictors that exist in the complete
cases. In other words, the imputed effect sizes will fit ex- 21.5 MODEL-BASED METHODS
actly the linear model of effect size that is estimated by FOR MISSING DATA
the complete cases. Any analyses that subsequently use
the regression imputed effect sizes will not adequately re- When data are MCAR, we have a number of simple op-
flect that missing data causes more uncertainty than the tions for analysis. When data are MAR, however, these
standard errors and fit of the effect size model indicate. simple methods do not produce unbiased estimates. The
Any single imputation method for missing effect sizes simple methods share a characteristicwith the excep-
will thus make the models appear to have a better fit to the tion of regression imputation, they do not make any as-
data than actually exists. When effect sizes are MAR or sumptions about the distribution of the data, and thus do
NMAR, none of the simple methods will provide ade- not use the information available in the cases with miss-
quate estimates of effect size models. ing values. For example, in complete case analysis, all of
If a synthesist cannot derive estimates of effect size the cases with missing values are deleted. Available case
from other information reported in a study, even the meth- analysis uses pairs of variables to compute correlations,
ods described later in this chapter may not provide ade- but does not use any information about the multivariate
quate estimates for MAR or NMAR effect sizes. Unfortu- relationships in the data. Little and Rubin described two
nately, missing effect sizes are likely NMAR, missing methods based on a model for the data (1987). Each of
because their values do not reach statistical significance. these strategies assumes a form for the conditional distri-
In this case, methods for censored data such as those bution of the observed data given the missing data. These
James Heckman (1976) and Takeshi Amemiya (1984) de- methods are maximum likelihood using the EM (expecta-
scribed could be used to estimate effect size models. The tionmaximization) algorithm and multiple imputation.
details of these analyses are beyond the scope of this text, These two methods are standard missing data methods in
and require specialized estimation techniques. statistical analysis. The next section provides a brief over-
The reviewer faced with missing effect sizes has few view of these two methods and describes how the meth-
options. Dropping the cases with missing effect sizes will ods can be applied to missing data in meta-analysis.
likely bias the estimated mean effect size upward, given
the likelihood that the missing effect sizes are nonsignifi- 21.5.1 Maximum Likelihood Methods Using
cant. Replacing missing effect sizes with a value of zero the EM Algorithm
may seem a sensible solution, but this strategy will tend
to underestimate the variance among effect sizes. Given A full explanation of maximum likelihood methods with
that missing effect sizes tend to be small, reviewers may missing data is not possible in this chapter (for more
wish to conduct sensitivity analyses similar to those de- information, see Little and Rubin 1987; Schafer 1997).
scribed for publication bias (see Hedges and Vevea 1996; Simply put, maximum likelihood methods for missing
Vevea and Woods 2005; chapter 23, this volume). data follow the general rules for computing maximum
likelihood estimators: given the data likelihood, estimates
21.4.5 Summary of Simple Methods for target parameters are found that maximize the value of
for Missing Data that likelihood. Using maximum likelihood methods for
missing data yields estimates for the means and variance-
When missing predictors are MCAR, complete case anal- covariance matrix for all the variables in the data set. If
ysis, available case analysis, and conditional mean impu- we assume that the probability of missing observations
410 DATA INTERPRETATION

does not depend on the missing values as in MCAR or instead maximum likelihood estimates of important data
MAR data (also called an ignorable response mecha- parametersthe means and variance-covariance matrix.
nism), then the likelihood of the hypothetically complete This can then be used to estimate the coefficients of a re-
data can be factored into two likelihoodsone for the gression model.
observed data, and one for the missing data conditional Several commonly used computing packages imple-
on the observed data. Maximizing this latter likelihood is ment EM. These include the SPSS Missing Value Analysis
difficult. Unlike the mean of a normal distribution, the module (Hill 1997). One remaining drawback of maxi-
maximum likelihood estimates with missing data do not mum likelihood methods for missing data is the computa-
have a closed form solution. That is, we cannot specify an tion of standard errors. SPSS Missing Value Analysis
equation for the estimates of important parameters that does not provide standard errors as an option. Instead, the
maximize the likelihood. Instead, the solution must be analyst needs to use numerical methods for finding the
found either using numerical techniques, such as the second derivative of a function, and compute these stan-
Newton-Raphson method to find the maximum of the dard errors by hand.
likelihood or the EM algorithm (Dempster, Laird, and 21.5.1.1 Maximum Likelihood Methods As de-
Rubin 1977). scribed, obtaining maximum likelihood estimates with
The EM algorithm starts with a data likelihood for the missing data requires two important assumptions: that the
hypothetically complete data. For many situations with missing data are either MCAR or MAR (also referred to
missing data, we assume these are distributed as a multi- as ignorable response mechanisms), and that the hypo-
variate normal distribution. The normal distribution has thetically complete data have a particular distribution
many properties that simplify the estimation procedures. such as multivariate normal. Though we may not have di-
In a multivariate normal distribution, the maximum like- rect evidence about the missing data mechanism, we can
lihood estimates are computed using the sums of squares use the methods discussed here to assess the sensitivity of
and the sum of squared cross products of the variables results to differing assumptions about the missing data
the sufficient statistics for the maximum likelihood esti- mechanism. Roderick Little provided a test of MCAR
mators. The algorithm starts with guesses for the means data that is implemented in SPSS Missing Values Analy-
and covariance matrix of the multivariate normal distri- sis (Little 1988; Hill 1997). This test may provide a syn-
bution. The E-step provides estimates of the values of the thesist with evidence that data are MCAR, but does not
sufficient statistics given the current guesses for the means provide evidence about any other form of missing data.
and covariance matrix. These estimates are derived from For any analysis with missing data, the decision to as-
the fact that the marginal distributions of a multivariate sume the data are MAR is a logical, not an empirical one,
normal are also normally distributed. For each pattern of and is based on the synthesists understanding and expe-
missing data, we can find a conditional mean and vari- rience in the field of study (Schafer 1997). We can never
ance matrix given the observed variables in that case. The have direct information about why data are missing in a
conditional means and variances are computed from re- meta-analysis. We can, however, increase our odds that
gressions of variables with missing values on variables the MAR assumption holds by gathering a rich data set
with completely observed variables. These conditional with several completely observed study descriptor vari-
means and variances are then used to compute the sums ables. The more information we have in a data set, the
and sums of cross products of the variables in the data. better the model of the conditional distribution of the
Once these sums are computed, they are used in the missing observations, given the completely observed vari-
M-step, as if they were they were the real values from the ables. Thorough and complete coding of a study may pro-
hypothetically complete data, and the algorithm computes vide descriptor variables that are related to the missing
new values of the means and covariance matrix. The al- values, and thus can improve the estimation of the means
gorithm goes back and forth between these two steps and variance-covariance matrix when using the EM algo-
until the estimates do not change much from one iteration rithm. In the earlier Hahn et al. example, if the synthesists
to the next. Once a researcher has maximum likelihood coded the age of the pregnant woman, the information,
estimates, he or she can use the means and covariance which may be or may not correlated with whether the
matrix to fit models to the data such as a linear regression. woman had previous pregnancies, can improve the maxi-
When using the EM algorithm, the researcher does not mum likelihood estimates by providing more information
obtain individual estimates of missing observations but about how variables relate in the data (2000).
HANDLING MISSING DATA 411

The second assumption requires the analyst to assume Table 21.10 Descriptive Statistics, Maximum Likelihood
a distribution for the hypothetically complete data to spec- Methods
ify the form of the likelihood. For meta-analysis applica-
tions, one choice is the multivariate normal distribution. Variable Mean SD
In this case, the joint distribution of the effect sizes and
the study descriptors are considered multivariate normal. Log-odds ratio
Full data 0.239 0.826
The advantage is that the conditional distribution of any MAR data 0.239 0.813
study descriptor is univariate normal conditional on the
other variables in the data. Intensity of dose
The assumption of multivariate normal can seem prob- Full data 2.559 0.660
MAR data 2.559 0.650
lematic in meta-analysis given that many study descriptor
variables are categorical, such as primary authors disci- Study age
pline of study or publication status of a study. Schafer Full data 20.353 13.112
pointed out, however, that even with deviations from MAR data 17.649 11.885
multivariate normality, the normal model is useful (1997). source: Authors compilation.
If particular variables in the data deviate from normality,
they can usually be transformed to normalize the distribu-
tion. In addition, if some of the variables are categorical the distribution of effect sizes depends on the sample
but completely observed, the multivariate normal model sizes in the study, synthesists do not assume that because
can be used if the analyst can make the assumption that a study uses a small sample size, that a small study will
within each of the subsets of data defined by the categor- estimate the age of participants with less precision than
ical variable, the data are multivariate normal, and that one with a larger sample. Kyle Fahrbach suggested in-
the parameters we want to estimate are relevant to the stead a form of the likelihood that includes estimating
conditional distribution within each of the defined sub- the variance component for the effect size, treating the
sets. For example, say that in the Hahn et al. (2000) data study descriptor variables as fixed, and thus not needing
we are interested in examining how the effectiveness of to weight all variables by the variance of the effect size
prophylactic malaria treatment varies as a function of the (2001). Unfortunately, Fahrbachs application of maxi-
country where the study took place and whether the mum likelihood is not implemented in any current soft-
woman was pregnant for the first time. If the synthesis ware package. If we ignore the weighting of effect sizes,
has complete information on the country where the study the EM results can be obtained through a number of pack-
took place, we need to assume that the variables collected aged computer programs, such as SPSS Missing Value
are distributed as a multivariate normal distribution. In Analysis (SPSS 2007) or the NORM, freeware available
other words, the multivariate normal model will apply to from Joseph Schafer (1999). More research is needed on
missing data situations in meta-analysis where the cate- the size of the bias that may result when using the EM
gorical variables in the model are completely observed. algorithm for missing data in meta-analysis without ap-
We can assume that the missing variables have a normal plying the weights for effect sizes.
distribution conditional on these categorical ones. 21.5.1.2 Example of Maximum Likelihood Methods
One difficulty in meta-analysis that does cause a prob- Tables 21.10 and 21.11 provide a comparison of the means,
lem with the multivariate normal assumption is the distri- standard deviations, and correlations from all of the data
bution of effect sizes. The variance of estimates of effect with the estimates from maximum likelihood using the
size depends on the sample size of the study, and thus EM algorithm. No weighting was used to generate esti-
varies across studies. In Therese Pigotts application of mates. In this simple example, the estimates are similar to
maximum likelihood to missing data in meta-analysis, those from regression imputation but closer to the true
the likelihood weighted all variables in the data by the values than regression imputation. Because EM does not
variance of the study (2001). This strategy leads to the provide standard errors for the estimates, we cannot test
assumption that studies with small sample sizes not only these estimates for statistical significance.
have effect sizes with larger standard errors, but also have 21.5.1.3 Other Applications of Maximum Likeli-
measured study descriptor variables with a precision in- hood Methods Other researchers than Fahrbach have ap-
versely proportional to the studys sample size. Although plied maximum likelihood methods for missing data to
412 DATA INTERPRETATION

Table 21.11 Correlations, Maximum Likelihood implies, the method imputes values for the missing data.
Methods There are two major differences, however, between sin-
gle-value imputation methods and multiple imputation.
Intensity of With multiple imputation, several possible values are es-
Dose and Study Age and Intensity of timated for each missing observation, resulting in several
Log-odds Log-odds Dose and complete data sets. The imputed values are derived by
Correlation Ratio Ratio Study Age simulating random draws from the probability distribu-
tion of the variables with missing observations using Mar-
Full data 0.058 0.049 0.498 kov chain Monte Carlo techniques. These distributions
MAR data 0.058 0.162 0.361
are based on an assumption of multivariate normality.
source: Authors compilation. Once we obtain a number of random draws, the complete
data sets created from these draws can then be analyzed
using the methods the researcher intended to use before
discovering missing data. Thus, unlike maximum likeli-
particular issues in meta-analysis (2001). Meng-Jia Wu,
hood methods using the EM algorithm, each missing ob-
for example, uses maximum likelihood methods with a
servation is filled in with multiple values. The analyst then
monotone pattern of missing data to obtain estimates for a
has a number of complete data sets analyzed separately
regression model of effect size (2006). In Wus example,
using standard methods. The result is a series of estimated
a number of studies present the results of a regression
results for each completed data set. These multiple values
analysis examining the relationship between teacher qual-
allow standard errors to be estimated using the variation
ity and many other variables. The missing data issue arises
among the series of results.
because the studies differ in the number of variables in
Like maximum likelihood with the EM algorithm,
their models. When the missing data pattern is mono-
multiple imputation consists of two steps. The first in-
tonethat is, when there is a block of studies that observe
volves generating possible values for the missing obser-
variables 12, another block that observe variables 14,
vations, and the second involves inference with multiply
another block that observe variables 16, and so onthe
imputed estimates. The more difficult is generating pos-
maximum likelihood estimates are more easily computed
sible values for the missing observations. The goal is to
than in general patterns. Wu demonstrates how to obtain a
obtain a random sample from the distribution of missing
set of synthesized regression coefficients for a set of stud-
values given the observed values. When the data are MAR
ies when not all studies observe the same study descrip-
or MCAR, the distribution of missing values given the
tors using maximum likelihood with the EM algorithm.
observed values does not depend on the values of the
Betsy Becker also uses the EM algorithm to obtain a
missing observations, as discussed earlier. The idea is to
synthesized correlation matrix across a set of studies
draw a random sample from the distribution of missing
(1994). This correlation matrix can then be used to esti-
observations given the observed variables, a complex dis-
mate explanatory models such as structural equation
tribution even when we assume multivariate normality.
models. Maximum likelihood methods with the EM algo-
One method is data augmentation, first discussed by
rithm can be applied in many situations where missing
Martin Tanner and Wing Wong (1987). Space consider-
data occurs, including in meta-analysis. The difficulty
ations do not allow a full explanation (for example, Rubin
with applying these methods to research synthesis centers
1987, 1996; Schafer 1997). In sum, though, the logic is
on posing the appropriate distribution for the multivariate
similar to that of the EM algorithm. If we know the means
data set. More research in this area, as well as work on
and covariance matrix of the hypothetically complete
accessible computer programs to compute results, should
data, we can use properties of the multivariate normal dis-
provide more options for reviewers struggling with miss-
tribution and Bayes theorem to define the distribution of
ing data issues.
missing values, given the observed variables. Tanner and
Wong provide an iterative method using an augmented
21.5.2 Multiple Imputation distribution of the missing variables. Usually, the imputed
values are randomly drawn from the probability distribu-
A second method for missing data, also based on the dis- tion at a particular iteration number (for example, at the
tribution of the data, is multiple imputation. As the name 100th, 200th, and 300th iterations beyond convergence).
HANDLING MISSING DATA 413

Schafer provided a number of practical guidelines for ing study age range from 0 to 42. Table 21.13 provides
generating multiple imputations (1997). the estimates of the means, standard error of the mean,
Researchers typically draw five different values from and correlations for each imputed data set, and then for
the distribution of missing values given the observed the multiply imputed estimates. The estimate for the mean
variables resulting in five completed data sets (for a ratio- study age is the same as in maximum likelihood using the
nale for the number of imputations needed, see Schafer EM algorithm. As Schafer discussed, the multiply im-
1997). After obtaining the series of completed data sets, puted estimates will approach the maximum likelihood
the synthesist analyzes the data as originally planned and estimates as the number of imputation increases (1997).
obtains a set of estimates for target parameters within The standard error of the mean of study age is 2.67 and is
each completed data set. For example, each set provides larger than the true value, 2.25, correctly reflecting the
a value for the mean of a variable. These estimates are added uncertainty in the estimate due to missing values.
then combined to obtain a multiply imputed estimate and
its standard error. 21.6 RECOMMENDATIONS
As Paul Allison pointed out, one problem with multiple
imputation is that meta-analysts using the same data set Missing data will occur in meta-analysis whenever a
will arrive at different values of the multiply imputed esti- range of studies and researchers focus on a given area.
mates because each synthesist will obtain different random Unlike primary researchers, research synthesists cannot
draws in creating their completed data sets (2002). This avoid missing data by working hard to collect all data
flexibility, however, could also be an asset. Sophisticated from all subjectsin this case, the subjects are studies in-
users could pose a number of distributions for the missing herently different in their methods and reporting quality.
data apart from multivariate normal to test sensitivity of What recommendations can we make for practice?
results to assumptions about the response mechanism. As illustrated in this chapter, missing data methods have
21.5.2.1 Multiple Imputation and Missing Data In as a cost the necessity of making assumptions about the
some respects, multiple imputation methods require less reasons for the missing data and the joint distribution of
specialized knowledge than maximum likelihood. One the complete data set. When a researcher has little miss-
advantage of multiple imputation is the ability to use the ing data relative to the number of studies retrieved, the
originally planned analysis on the data. Once the synthe- bias introduced by assuming MCAR and using complete
sist obtains the completed data sets, the analysis can pro- cases will also be relatively small. With MCAR data,
ceed with standard techniques. Multiple imputation also complete case analysis has the potential of providing un-
provides standard errors for estimation that are easily biased results, and does not overestimate the precision of
computed and that accurately reflect the uncertainty in the estimates.
the data when observations are missing. In meta-analysis, however, missing data across several
As in maximum likelihood methods, the likelihood for variables could lead to fewer studies that observe all vari-
meta-analysis data is complicated by effect sizes not ables relative to the number of studies gathered for syn-
being identically distributed, but instead having variances thesis. A large, rich meta-analysis data set with many
that depend on the sample size in each study. Fahrbachs study descriptors has more potential than a smaller one to
suggested likelihood distribution should be examined for support both the assumption of MAR data and of multi-
use in multiple imputation methods as well (2001). variate normality. With MAR data, the options are using
Multiple imputation is now more widely available in maximum likelihood methods with the EM algorithm or
statistical computing packages. SAS, to take one example, multiple imputation. The major benefit of these over more
uses multiple imputation procedures with multivariate commonly used methods is that they are based on a model
normal data (Yuan 2000). The examples in this chapter for the data and use as much information as possible from
were computed with Schafers NORM program, freeware the data set.
for conducting multiple imputation with multivariate nor- In applying these methods to meta-analysis, each tech-
mal data (1999). nique has its own set of difficulties. Maximum likelihood
21.5.2.2 Example of Multiple Imputation Table methods, for example, are now implemented in popular
21.12 provides the original data set along with the values statistical software, but do not provide standard errors of
for each completed data set after random draws from the the estimates and does not allow the weighting needed for
distribution of missing values. The imputations for miss- effect size data.
414 DATA INTERPRETATION

Table 21.12 Oral Anticoagulant Therapy Data

Log-odds Intensity
ID ratio of dose Study age Imputations

1.00 3.02 High 39.001 21 17 2 21 6


2.00 1.84 High 39.001 10 20 25 44 24
3.00 0.24 High 38.00
4.00 0.15 High 38.001 3 48 20 37 19
5.00 0.47 High 35.001 12 6 30 19 17
6.00 0.38 High 35.00
7.00 0.38 High 33.001 25 56 16 12 1
8.00 0.47 High 32.00
9.00 0.25 High 32.001 25 34 34 26 20
10.00 0.22 High 30.001 19 28 16 34 17
11.00 0.84 High 30.001 16 17 9 4 3
12.00 0.35 High 19.001 9 0 23 0 11
13.00 0.33 High 9.00
14.00 0.12 High 5.00
15.00 1.81 High 30.00
16.00 0.43 High 25.00
17.00 0.18 High 19.00
18.00 0.39 High 19.00
19.00 0.90 High 6.00
20.00 0.65 High 3.001 26 42 14 35 29
21.00 1.42 High 9.00
22.00 0.14 High 9.00
23.00 0.04 Moderate 17.00
24.00 0.35 Moderate 18.001 18 9 20 24 12
25.00 0.08 Moderate 17.00
26.00 0.06 Moderate 30.00
27.00 0.43 Moderate 35.00
28.00 1.16 Moderate 13.00
29.00 0.75 Moderate 17.00
30.00 0.81 Moderate 1.00
31.00 0.04 Moderate 5.00
32.00 0.35 Low 1.00
33.00 0.34 Low 2.001 5 23 5 0 13
34.00 0.15 Low 2.00

source: Authors compilation.


1Value deleted for MAR data example

Multiple imputation is the more flexible of the two op- program several times using regression imputation with
tions, but is not widely implemented in statistical pack- estimates augmented by a random deviate. Research is
ages. The SAS program has a procedure for both multiple needed on the behavior of this procedure compared with
imputation and for combing the results across completed the data augmentation procedures Schafer described.
data sets. Although the SPSS Missing Values program After the researcher obtains multiple imputations, the
claims that multiple imputation is possible, the program procedures to compute multiply imputed estimates are
does not implement the multiple imputation procedures straightforward.
as outlined by Donald Rubin (1987) and Joseph Schafer When applying any statistical method, with or without
(1997). Instead, the manual recommends running the missing data, the synthesist is obligated to discuss the
HANDLING MISSING DATA 415

Table 21.13 Correlations, Maximum Likelihood Computer. Journal of the Royal Statistical Society Series B
Methods 22(2): 3023.
Chan, An-Wen, Asbjrn Hrbjartsson, Mette T. Haahr, Peter
Study Age Correlations C. Gtzsche, and Douglas G. Altman. 2004. Empirical
Evidence for Selective Reporting of Outcomes in Rando-
Study Age Study Age
mized Trials. Journal of the American Medical Associa-
and and
tion 291(20): 245765.
SE Log-Odds Intensity
Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin.
Mean (Mean) Ratio of Dose
1977. Maximum Likelihood from Incomplete Data via the
Imputation 1 16.56 1.84 0.01 0.34 EM Algorithm. Journal of the Royal Statistical Society Se-
Imputation 2 19.82 2.47 0.07 0.32 ries B 39(1): 138.
Imputation 3 17.29 1.91 0.26 0.36 Egger, Matthias, and George Davey. 1998. Meta-Analysis:
Imputation 4 18.53 2.25 0.10 0.40 Bias in Location and Selection of Studies. British Medical
Imputation 5 16.06 1.86 0.20 0.26 Journal 316(7124): 616.
Multiple 17.65 2.67 0.12 0.34 Fahrbach, Kyle R. 2001. An Investigation of Methods for
Imputation Mixed-Model Meta-Analysis in the Presence of Missing
Full data 20.35 2.25 0.05 0.50 Data. PhD diss. Michigan State University.
Glasser, Marvin. 1964. Linear Regression Analysis with
source: Authors compilation.
Missing Observations among the Independent Variables.
Journal of the American Statistical Association 59(307):
83444.
assumptions made in the analysis. In the case of methods
Hahn, Seokyung, Paula R. Williamson, Jane L. Hutton, Paul
for missing data, we cannot provide empirical evidence
Garner, and E. Victor Flynn. 2000. Assessing the Potential
about the truth of the assumptions. We can, however, fully
for Bias in Meta-Analysis Due to Selective Reporting of
disclose the assumptions of each analysis and compare
Subgroup Analyses within Studies. Statistics in Medicine
the results to examine the sensitivity of our findings.
19(24): 332536.
Haitovsky, Yoel. 1968. Missing Data in Regression Analy-
21.7 REFERENCES sis. Journal of the Royal Statistical Society Series B 30(1):
6782.
Allison, Paul D. 2002. Missing Data. Thousand Oaks, Calif.: Heckman, James. 1976. The Common Structure of Statisti-
Sage Publications. cal Models of Truncation, Sample Selection and Limited
American Psychological Association. 2005. Publication Ma- Dependent Variables, and a Simple Estimator for Such Mo-
nual of the American Psychological Association, 5th ed. dels. Annals of Economic and Social Measurement 5(4):
Washington, D.C.: APA. 47592.
Amemiya, Takeshi. 1984. Tobit Models: A Survey. Jour- Hedges, Larry V., and Jack L. Vevea. 1996. Estimating Ef-
nal of Econometrics 24(12): 361. fect Size under Publication Bias: Small Sample Properties
Anand, Sonia S. and Salim Yusuf. 1999. Oral Anticoagu- and Robustness of a Random Effects Selection Model.
lant Therapy In Patients With Coronary Artery Disease. Journal of Educational and Behavioral Statistics 21(4):
Journal of the American Medical Association 282(21): 299332.
205867. Hemminki, E. 1980. Study of Information Submitted by
Becker, Betsy J. 1992. Using Results From Replicated Stu- Drug Companies to Licensing Authorities. British Medi-
dies To Estimate Linear Models. Journal of Educational cal Journal 280(6217): 83336.
Statistics 17(3): 34162. Hill, MaryAnn. 1997. SPSS Missing Value Analysis 7.5.
Begg, Colin B., and Jesse A. Berlin. 1988. Publication Bias: Chicago: SPSS Inc.
A Problem in Interpreting Medical Data (with Discussion). Jni, Peter, Franziska Holenstein, Jonathon Sterne, Christo-
Journal of the Royal Statistical Society Series A 151(2): pher Bartlett, and Matthias Egger. 2002. Direction and
41963. Impact of Language Bias in Meta-Analyses of Controlled
Buck, S. F. 1960. A Method of Estimation of Missing Values trials: Empirical Study. International Journal of Epidemio-
in Multivariate Data Suitable for Use with an Electronic logy 31(1): 11523.
416 DATA INTERPRETATION

Kim, Jae-On, and James Curry. 1977. The Treatment of Schafer, Joseph L. 1999. NORM: Multiple Imputation Of
Missing Data in Multivariate Analysis. Sociological Incomplete Multivariate Data under a Normal Model, ver-
Methods and Research 6(2): 21540. sion 2. http://www.stat.psu.edu/~jls/misoftwa.html#win
Little, Roderick J. A. 1988. A Test of Missing Completely Schafer, Joseph L. and John W. Graham. 2002. Missing
at Random for Multivariate Data with Missing Values. Data: Our View of the State of the Art. Psychological
Journal of the American Statistical Association 83(404): Methods 7(2): 14777.
11981202. Shadish, William R., Leslie Robinson, and Congxiao Lu.
Little, Roderick J. A., and Donald B. Rubin. 1987. Statistical 1997. Effect Size Calculator. St. Paul, Minn.: Assessment
analysis with Missing Data. New York: John Wiley & Systems Corporation. http://www.assess.com.
Sons. Sirin, Selcuk R. 2005. Socioeconomic Status and Academic
Orwin, Robert G., and David S. Cordray. 1985. Effects Achievement: A Meta-Analytic Review of Research. Re-
of Deficient Reporting on Meta-Analysis: A Conceptual view of Educational Research 75(3): 41753.
Framework and Reanalysis. Psychological Bulletin 97(1): Smith, Mary L. 1980. Publication Bias and Meta-Analysis.
13447. Evaluation in Education 4: 2224.
Pigott, Therese D. 2001. Missing Predictors in Models of SPSS. 2007. SPSS Missing Value Analysis. http://www.spss
Effect Size. Evaluation & the Health Professions 24(3): .com/missing_value.
277307. Tanner, Martin A., and Wing H. Wong. 1987. The Calcula-
Rosenthal, Robert. 1979. The File Drawer Problem and tion of Posterior Distributions by Data Augmentation.
Tolerance for Null Results. Psychological Bulletin 86(3): Journal of the American Statistical Association 82(398):
63841. 52840.
Rothstein, Hannah R., Alexander J. Sutton, and Michael Vevea, Jack L and Carol M. Woods. 2005. Publication Bias
Borenstein. 2005. Publication Bias in Meta-Analysis: Pre- in Research Synthesis: Sensitivity Analysis Using A Priori
vention, Assessment and Adjustments. West Sussex: John Weight Functions. Psychological Methods 10(4): 42843.
Wiley & Sons. Willamson, Paula R., Carrol Gamble, Douglas G. Altman,
Rubin, Donald B. 1976. Inference and Missing Data. Bio- and J. L. Hutton. 2005. Outcome Selection Bias in Meta-
metrika 63(3): 58192. Analysis. Statistical Methods in Medical Research 14(5):
Rubin, Donald B. 1987. Multiple Imputation for Nonres- 51524.
ponse in Surveys. New York: John Wiley & Sons. Wu, Meng-Jia. 2006. Methods of Meta-analyzing Regression
Rubin, Donald B. 1996. Multiple Imputation after 18 Studies: Applications of Generalized Least Squares and Fac-
Years. Journal of the American Statistical Association tored Likelihoods. PhD diss., Michigan State University.
91(434): 47389. Yuan, Yang C. 2000. Multiple Imputation for Missing Data:
Schafer, Joseph L. 1997. Analysis of Incomplete Multiva- Concepts and New Development. http://support.sas.com/
riate Data. New York: Chapman & Hall. rnd/app/papers/multipleimputation.pdf
22
SENSITIVITY ANALYSIS AND DIAGNOSTICS

JOEL B. GREENHOUSE SATISH IYENGAR


Carnegie Mellon University University of Pittsburgh

CONTENTS
22.1 Introduction 418
22.2 Retrieval of Literature 419
22.3 Exploratory Data Analysis 420
22.3.1 Stem-and-Leaf Plot 420
22.3.2 Box Plots 421
22.4 Combining Effect Size Estimates 423
22.4.1 Fixed Effects 423
22.4.2 Random Effects 424
22.4.3 Bayesian Hierarchical Models 426
22.4.3.1 Specification of Prior Distributions 427
22.4.4 Other Methods 428
22.5 Publication Bias 428
22.5.1 Assessing the Presence of Publication Bias 428
22.5.2 Assessing the Impact of Publication Bias 430
22.5.3 Other Approaches 431
22.6 Summary 431
22.7 Appendix 431
22.8 References 431

417
418 DATA INTERPRETATION

22.1 INTRODUCTION conclusions. We view the process of asking these ques-


tions as part of sensitivity analysis because it encourages
At every step in a research synthesis, decisions are made the meta-analyst to reflect on decisions and to consider
that can affect the conclusions and inferences drawn from the consequences of those decisions in terms of the sub-
that analysis. Sometimes a decision is easy to defend, sequent interpretation and generalizablility of the results.
such as one to omit a poor quality study from the meta- Our focus in this chapter is primarily on quantitative
analysis based on prespecified exclusion criteria, or to use methods for sensitivity analysis. It is important, however,
a weighted average instead of an unweighted one to esti- to keep in mind that these early decisions about the nature
mate an effect size. At other times, the decision is less of the research question and identifying and retrieving the
convincing, such as that to use only the published litera- literature can have a profound impact on the usefulness of
ture, to omit a study with an unusual effect size estimate, the meta-analysis.
or to use a fixed effects instead of a random effects model. We will use a rather broad brush to discuss sensitivity
When the basis for a decision is tenuous, it is important to analysis. Because, as we have suggested, sensitivity anal-
check whether reasonable alternatives appreciably affect ysis is potentially applicable at every step of a meta-analy-
the conclusions. In other words, it is important to check sis, we touch on many topics covered in greater detail
how sensitive the conclusions are to the method of analy- elsewhere in this volume. Our aim here is to provide a uni-
sis or to changes in the data. Sensitivity analysis is a sys- fied approach to the assessment of the robustness, or co-
tematic approach to address the question, What happens gency, of the conclusions of a research synthesis. The key
if some aspect of the data or the analysis is changed? element to this approach is simply the recognition of the
When meta-analysis first came on the scene, it was importance of asking the what if question. Although a
greeted with considerable skepticism. Over time, with ad- meta-analysis need not concern itself with each issue dis-
vances in methodology and with the experience due to cussed here, our point of view, and the technical tools we
increased use, meta-analysis has become an indispens- describe in this chapter should be useful generally. Many
able research tool for synthesizing evidence about ques- of these tools come from the extensive statistical literature
tions of importance, not only in the social sciences but on exploratory data analysis. Some, though, have been de-
also in the biomedical and public health sciences (Gaver veloped largely in the context of meta-analysis itself, such
et al. 1992). Given the prominent role that meta-analysis as those for assessing the effects of publication bias.
now plays in informing decisions that have wide public To illustrate some of the quantitative approaches, we
health and public policy consequences, the need for sen- use two example data sets throughout this chapter. The
sitivity analyses to assess the robustness of the conclu- first is a subset of studies from a larger study of the ef-
sions of a meta-analysis has never been greater. fectiveness of open education programs (Hedges, Giaco-
A sensitivity analysis need not be quantitative. For in- nia, and Gage 1981). The objective was to examine the
stance, the first step in a meta-analysis is to formulate effects of open education on student outcomes by com-
the research question. Although a quantitative sensitivity paring students in experimental open classroom schools
analysis is not appropriate here, it is, nevertheless, useful with those from traditional schools. Specifically, in table
to entertain variations of the research question before 22.1, we look at the studies presented in table 8 of Larry
proceeding further, that is, to ask the what if question. Hedges and Ingram Olkin (1985, 25) consisting of the
Because different questions require different types of randomized experiments and other well-controlled stud-
data and data analytic methods, it is important to state ies of the effects of open education on self-concept. The
clearly the aims of the meta-analysis at the outset to col- table presents the students grade levels, the sample size
lect the relevant data and to explain why certain questions for each group, a weight for each study, and an unbiased
(and not others) were addressed. Or, at another step de- estimate, T, of the standardized mean difference for each
cisions are made concerning the identification and re- study. The second data set that we will use when we dis-
trieval of studies. Here the decision is which studies to cuss random effects models comes from table 7 of Hedges
include. Should only peer-reviewed, published studies be and Olkin (1985, 24) on the effects of open education on
included? What about conference proceedings or disser- student independence and self-reliance. These data are
tations? Should only randomized controlled studies be presented in table 22.2. Unbiased estimates of the stan-
considered? These are decisions that a meta-analyst must dardized mean differences are shown for the seven stud-
make early on that can affect the generalizability of the ies there, all of which have the same sample sizes. No
SENSITIVITY ANALYSIS AND DIAGNOSTICS 419

Table 22.1 Effects of Open Education on Student Self-Concept

Standardized
Open Traditional Mean
Education School Weights Difference
Study Grade nT nC wi T

1 46 100 180 0.115 0.100


2 46 131 138 0.120 0.162
3 46 40 40 0.036 0.090
4 46 40 40 0.036 0.049
5 46 97 47 0.057 0.046
6 K3 28 61 0.034 0.010
7 K3 60 55 0.051 0.431
8 46 72 102 0.076 0.261
9 46 87 45 0.053 0.134
10 K3 80 49 0.054 0.019
11 K3 79 55 0.058 0.175
12 46 40 109 0.052 0.056
13 46 36 93 0.046 0.045
14 K3 9 18 0.011 0.103
15 K3 14 16 0.013 0.121
16 46 21 22 0.019 0.482
17 46 133 124 0.115 0.290
18 K3 83 45 0.052 0.342

source: Authors compilation.

Table 22.2 Effects of Open Education on


for a research synthesis is not as straightforward as it
Student Independence and
might seem. Several questions arise at the outset of this
Self-Reliance
time-consuming yet essential component of a meta-anal-
ysis. Should a single person search the literature, or
Study nE  nC T should (time and resources permitting) at least two peo-
ple conduct independent searches? Should the search
1 30 0.699 concentrate on the published literature only (that is, from
2 30 0.091 peer-reviewed journals), or should it attempt to retrieve
3 30 0.058 all possible studies? Which time period should be cov-
4 30 0.079 ered by the search? The first two questions deal with the
5 30 0.235 concern that the studies missed by a search may arrive at
6 30 0.494 qualitatively different conclusions than those retrieved.
7 30 0.587
The third question is more subtle: it is possible that within
source: Authors compilation. a period of time studies yield homogeneous results, but
that different periods show considerable heterogeneity.
For instance, initial enthusiastically positive reports may
covariate information, such as grade level, was given in be followed by a sober period of null, or even negative
Hedges and Olkin (1985) for this data set. results. Geoffrey Borman and Jeffrey Grigg also discuss
the use of graphical displays to see how research results
22.2 RETRIEVAL OF LITERATURE in a variety of fields change with the passage of time
(chapter 26, this volume).
The discussions in part III of this volume clearly demon- Using more than one investigator to search the litera-
strate that retrieving papers that describe relevant studies ture is itself a type of sensitivity analysis. It is especially
420 DATA INTERPRETATION

useful when the central question in a research synthesis more formal statistical procedures and on the interpreta-
spans several disciplines and the identification of relevant tion of their results.
studies requires subject-matter expertise in these disci- The approach described, known as exploratory data
plines (see, for example, Greenhouse et al. 1990). It is analysis, is based on the premise that the more that is
intuitively clear that if the overlap between the searches is known about the data, the more effectively it can be used
large, we would be more confident that a thorough re- to develop, test, and refine theory. Methods should be
trieval had been done. (A formal approach for assessing relatively easy to do with or without a computer, quick to
the completeness of a literature search might involve use so that a data set may be explored from different
capture-recapture calculations of population size estima- points of view, and robust in that they are not adversely
tion; see, for example, Fienberg, Johnson, and Junker influenced by such misleading phenomena as extreme
1999). Clearly, the more comprehensive a search, the cases, measurement errors, or unique cases that need spe-
better, for then the meta-analysis can use more of the cial attention. Two methods discussed, the stem-and-leaf
available data. However, a thorough search may not be plot and the box plot, are graphical methods that satisfy
feasible because of various practical constraints. To de- these requirements. They are powerful tools for develop-
cide whether a search limited, for example, to the pub- ing insights and hypotheses about the data in a meta-
lished literature can lead to biased results, it is worth- analysis.
while to do a small search of the unpublished literature
(for example, dissertations, conference proceedings, and
abstracts) and compare the two (for a more complete dis- 22.3.1 Stem-and-Leaf Plot
cussion, see chapters 2 and 3 of this volume; Wachter and
In a stem-and-leaf plot, the data values are sorted into
Straf 1990, chapter 15).
numerical order and brought together quickly and effi-
When a substantial group of studies is retrieved, and
ciently in a graphic display. The stem-and-leaf plot uses
data from them gleaned, it is useful to investigate the re-
all of the data and illustrates the shape of a distribution,
lationship between study characteristics and study results.
that is, shows whether it is symmetric or skewed, how
Some of the questions that can be addressed using the
many peaks it has, and whether it has outliers (atypical
exploratory data analytic tools are discussed in section
extreme values) or gaps within the distribution. Figure
22.3. Do published and unpublished studies yield differ-
22.1 shows the stem-and-leaf plot for the distribution of
ent results? If the studies are rated according to quality,
the standardized mean differences, T, for the 18 studies
do the lower quality studies show greater variability or
given in table 22.1. Here, for ease of exposition, we have
systematic bias? When two disciplines touch on the same
rounded the values of the standardized mean differences
issue, how do the results of studies compare? When these
in the last column of table 22.1 to two significant digits.
issues arise early in the literature search, it is possible to
The first significant digit of T is designated the stem and
alter the focus of the retrieval process to ensure a more
the second digit the leaf. For example, the rounded stan-
comprehensive database.
dardized mean difference for study 2 is 0.16. Therefore,
the stem is 1 and the leaf is 6. The possible stem values
22.3 EXPLORATORY DATA ANALYSIS are listed vertically in increasing order from top (negative
stems) to bottom (positive stems) and a vertical line is
Once the literature has been searched and studies have
drawn to the right of the stems. The leaves are recorded
been retrieved, the next step is to analyze effect size. For-
directly from table 22.1 onto the lines corresponding to
mal statistical procedures for determining the significance
their stem value to the right of the vertical line. Within
of an effect size summary usually are based on assump-
each stem, the leaves should be arranged in increasing
tions about the probabilistic process generating the obser-
order away from the stem.
vations. Violations of these assumptions can have an im-
In figure 22.1 we can begin to see some important fea-
portant impact on the validity of the conclusions. Here
tures of the distribution of T.
we discuss relatively simple methods for investigating
features of a data set and the validity of the statistical as- 1. Typical values. It is not difficult to locate by eye the
sumptions. We consider these methods to be a part of a typical values or central location of the distribution
sensitivity analysis because they help reveal features of of T by finding the stem or stems where most of the
the data that could have a large impact on the choice of observations fall. For the distribution in figure 22.1,
SENSITIVITY ANALYSIS AND DIAGNOSTICS 421

4 38 leaf plot and the box plot, described in the next section,
3 are both useful exploratory data analytic techniques for
2 6 carrying out these aspects of a sensitivity analysis.
1 6
0 1449 22.3.2 Box Plots
0 145
1 00237 As noted earlier, major features of a distribution include
2 9 the central location of the data, how spread out the data
3 4 are, the shape of the distribution, and the presence of any
unusual observations, called outliers. The box plot is a
Figure 22.1 Stem-and-Leaf Plot of Standardized Mean Dif- visual display of numerical summaries, based in part on
ferences. sample percentiles, that highlights such major features of
source: Authors compilation. the data. The box plot provides a clear picture of where
note: For studies reported in table 22.1 (n  18; leaf unit  0.010). the middle of the distribution lies, how spread out the
middle is, the shape of the distribution, and identifies out-
liers in the distribution. A box plot differs from a stem-
we see that the center of the distribution seems to be and-leaf plot in several ways. The box plot is constructed
in the stems corresponding to values of T between from numerical summaries of the distribution and does
0.01 to 0.17. not simply display every observation as the stem-and-leaf
2. Shape. Does the distribution have one peak or sev- plot does. It provides a much clearer picture of the tails of
eral peaks? Is it approximately symmetric or is it the distribution and is more useful for identifying outliers
skewed in one direction? We see that the distribu- than the stem-and-leaf plot. Sample percentiles are used
tion in figure 22.1 may have two peaks and seems to to construct the box plot because percentiles, such as the
have a longer lower tail, that is, skewed toward neg- median (50th percentile), are robust measures of location.
ative values. The mean, by contrast, is not a robust measure of central
location because it is influenced by atypical observations,
3. Gaps and atypical values. The stem-and-leaf plot is
or outliers.
useful for identifying atypical observations, which
Figure 22.2 presents a box plot of the sample distribu-
are usually highlighted by gaps in the distribution.
tion of T. We describe the features of this box plot. The
Note, for example, the gap between study #8 with a
box itself is drawn from the first quartile (Q1  0.109)
T of 0.26 and the next two studies, study #7 with
to the third quartile (Q3  0.127). The plus sign inside the
a T of 0.43, and study #16 with a T of 0.49, sug-
box denotes the median (m  0.032). The difference or
gesting that these latter two studies might be un-
distance between the third quartile and the first quartile
usual observations.
(Q3 Q1  0.127 (0.109)  .236) is called the inter-
For generalizations and modifications of the stem-and- quartile range (IQR) and is a robust measure of variabil-
leaf plot for other sorts of data configurations, see Lam- ity. The IQR is the length of the box in figure 22.2. The
bert Koopmans (1987, 922). dotted lines going in either direction from the ends of the
At this stage of a meta-analysis, the stem-and-leaf plot box (that is, away from Q1 and Q1 respectively) denote
is useful in helping the data analyst become more familiar the tails of the distribution. The two asterisks in figure
with the sample distribution of the effect size. From the 22.2 denote outliers, 0.491 and 0.434, and are so de-
point of view of sensitivity analysis, we are particularly fined because they are located a distance more than one
interested in identifying unusual or outlying studies, de- and a half times the IQR from the nearest quartile. The
ciding whether the assumptions of more formal statistical dotted lines extend to an observed data point within one
procedures for making inferences about the population and a half times the IQR from each respective quartile to
effect size are satisfied, and investigating whether there establish regions far enough from the center of the distri-
are important differences in the sample distributions of bution to highlight observations that may be outliers (see
effect size for different values or levels of an explanatory Koopmans 1987, 5157). Note that the sample mean of T
variable. The latter is particularly interesting in furthering is 0.008 and is smaller than the median because of the
an understanding of how studies differ. The stem-and- inclusion of the two negative atypical values.
422 DATA INTERPRETATION

.5 .4 .3 .2 .1 .0 .1 .2 .3 .4 .5

Figure 22.2 Box Plot of Standardized Mean Differences.


source: Authors compilation.
note: For studies reported in table 22.1

GRADE
46 


K3

.5 .4 .3 .2 .1 .0 .1 .2 .3 .4 .5

Figure 22.3 Comparative Box Plots of Distribution of Standardized Mean Differences.


source: Authors compilation.
note: For studies of grades K3 versus grade 46

From the viewpoint of sensitivity analysis, awareness Table 22.3 Numerical Summary Measures
of unusual features of the data, such as outliers or skew- for Distribution of T
ness, can guide the analyst with respect to the choice of
statistical methods for synthesis, and alert the analyst to K3 46 Total
the potential for erroneous conclusions based on a few
n 7 11 18
influential observations. Furthermore, it is often the aim
median 0.103 0.046 0.032
of a research synthesis to compare the sample distribu- mean 0.046 0.043 0.008
tion of effect size at different values of an explanatory Q1 0.010 0.162 0.109
variable, that is, to characterize how studies differ and Q3 0.176 0.100 0.127
why. Box plots are particularly useful for comparing dis- IQR 0.186 0.262 0.236
tributions because they so clearly highlight important fea-
tures of the data. For example, we might be interested in source: Authors compilation.
seeing whether the effects of open education on student
self-concept is different for students in grades K3 versus
children in grades four through six, suggesting a
grades four through six. The investigation of such ques-
potentially greater effect of open education on self-
tions using exploratory data analysis techniques is a form
concept in the younger students. Even though the
of sensitivity analysis. As an illustration, we present in
median effect size for the studies of children in
figure 22.3, a pair of box plots displaying the distribution
grades four through six is smaller than that for those
of T in each grade range. Table 22.3 gives the relevant
of children in grades K3, note that the two distri-
numerical summary measures for each grade level. A
butions have significant overlap.
comparison of the two distributions is revealing:
2. Spread. Judging by the IQRs for each distribution,
1. Central location. The sample distribution of T for that is, the length of the boxes in figure 22.3, the
the studies of children in grades K3 is shifted to spread of the sample distribution of T is a little
the right relative to the distribution for the studies of larger for the studies of children in grades four
SENSITIVITY ANALYSIS AND DIAGNOSTICS 423

through six, 0.262, versus 0.186 for the studies of bility distribution of effect sizes? Is there evidence from
children in grades K3. the data to support a fixed or random effects model?
3. Shape. The shape of the middle 50 percent of the Should we use a weighted or unweighted summary mea-
sample distribution of T for the studies of children sure of effect size? Should we use parametric or nonpara-
in grades K3 is skewed to the left. Note that the metric procedures for assessing the statistical significance
median is closer to Q3 than to Q1. Yet, we see that of our combined results? In this section we frame some of
the right tail is longer than the left. The shape of these questions as issues in sensitivity analysis.
the middle 50 percent of the distribution for the
studies of children in grades four through six is ap- 22.4.1 Fixed Effects
proximately symmetric, though the median is a lit- Part III of this volume contains descriptions of effect size
tle closer to Q1 than to Q3, and the right tail is a little estimates from a single study, and methods of combining
longer than the left, suggesting a slight skewness to them across studies (see also Hedges and Olkin 1985). In
the right. practice, when there is a choice between two analyses or
4. Outliers. We see that each distribution has an out- between two summary measures, it is prudent to do both
lier. The solid circle in the box plot for the studies analyses to see how the conclusions differ and to under-
of children in grades K3 indicates that the outlier stand why they differ (Greenhouse et al. 1990). For ex-
in this group of studies, study 5, is very far away ample, the average of the standardized mean difference
from Q1, whereas the outlier in the group of studies for the studies of the effect of open__ education on student
for children in grades four through six is not quite self-concept given in table 22.1 is T  .008 with an ap-
as far out. proximate 95 percent confidence interval of (0.111,
0.023). Because zero is contained in this interval, we
Having identified the outliers in each group, decisions
would conclude that there is no statistically significant
about how to handle these potentially anomalous studies
difference between the effect of open education and tradi-
in subsequent analyses are necessary. Nevertheless, be-
tional education on student self-concept. It is more com-
cause of the use of robust measures of central location in
mon to consider the effect of different sample sizes in the
the box plots displayed in figure 22.3, we have a rela-
primary studies by calculating__a weighted average of the
tively good feeling for the amount of overlap of the distri-
standardized mean difference, T w using the weights given
butions for the two grade ranges. As a result, for example,
in table 22.1 where
we might be wary of a single summary measure of effect

/ /
k
size obtained by pooling across the two grade ranges. In
summary, informal exploratory analyses from the per-
wi  i i1
i and i  (nTi  nCi ) (nTi  nCi ).
spective of sensitivity analysis have been useful in high- From table 22.1 we find the weighted average of the stan-
lighting features of the data that we need to be aware of, dardized mean difference to be
sources of heterogeneity, and situations where pooling __
across a stratification variable might be inappropriate. Tw  w1  T1. . . wk  Tk  0.010.

22.4 COMBINING EFFECT SIZE ESTIMATES with an approximate 95 percent confidence interval of
(0.073, 0.093). We see that our conclusions about the
Based on the exploratory analysis, we have a better un- population effect size in this case do not depend on
derstanding of what the interesting features of the data set whether we use a weighted or unweighted estimate.
are, the effect on the research synthesis of studies that are At this stage, it is of interest to see whether any indi-
outliers, and issues related to distributional assumptions. vidual study or groups of studies have a large influence
In the next step, decisions must be made about how to on the effect size estimate. For example, recall that in sec-
combine and summarize the effect size estimates. For ex- tion 22.3 we identified several studies as potential outli-
ample, if we assume a fixed effects model, the implica- ers. One relatively simple method for this is to hold one
tion is that studies in our research synthesis represent a study out, calculate the average effect size by combining
random sample from the same population with a common the remaining studies, and compare the results with the
but unknown population effect. If we assume a random overall combined estimate. Here we are interested in as-
effects model, are we then explicitly assuming a proba- sessing how sensitive the combined measure of the effect
424 DATA INTERPRETATION

Table 22.4 The Leave One Out for the Studies However, in addition, we now see that studies 2 and 8 are
in Table 22.1 fairly influential studies, respectively. Because they are
larger studies, removing them one at a time from the anal-
Study Left ysis causes the weighted average effect size to increase.
Out Unweighted
__ Weighted
__ Computer software is available to facilitate the implemen-
i T T(i) Tw(i) tation of a leave-one-out analysis (see, for example, Bo-
renstein and Rothstein 1999).
1 0.100 0.015 0.001
Leaving out more than one study at a time can become
2 0.162 0.001 0.030
3 0.091 0.003 0.013 very complicated. For instance, if we leave out all pairs,
4 0.049 0.006 0.012 all triplets, and so on, the number of cases to consider
5 0.046 0.006 0.013 would be too large to give any useful information. It is
6 0.010 0.008 0.011 better to identify beforehand covariates that are of inter-
7 0.434 0.017 0.032 est, and leave out studies according to levels of the co-
8 0.262 0.008 0.030 variates. In our example, the only covariate provided is
9 0.135 0.017 0.003 the grade level (K3 and four through six). Leaving out
10 0.019 0.010 0.009 one level of that covariate is thus equivalent to the box
11 0.176 0.019 0.000 plot analysis in table 22.3 and figure 22.3. Finally, we
12 0.056 0.012 0.007
note that the leave-one-out method can be used for other
13 0.045 0.011 0.008
14 0.106 0.015 0.009 combining procedures. For instance, the maximum likeli-
15 0.124 0.016 0.009 hood estimate (MLE) is common when we have a spe-
16 0.491 0.020 0.020 cific model for the effect sizes. In that case, the sensitivity
17 0.291 0.026 0.023 of each case can be assessed as shown above; of course,
18 0.344 0.029 0.008 the computational burden involved will be much greater.
median 0.032 0.011 0.009
22.4.2 Random Effects
source: Authors compilation.
In the previous sections, the covariate information (grade
level) was useful in accounting for some of the variability
size is to any one particular study. In other words, how among the effect size estimates. There are other times,
large is the influence__ of each study on the overall estimate though, when such attempts fail to find consistent rela-
of effect size? Let T (i) be the unweighted average stan- tionships between study characteristics and effect size
dardized mean difference computed from all the __studies (Hedges and Olkin, 1985, 90). In such cases, random ef-
except study i. Table 22.4 presents the values of T (i) for fects models provide an alternative framework for com-
each study in table 22.1. We see that the effect of leaving bining effect sizes. These models regard the effect sizes
study 7 or 16 out of the unweighted average of the stan- themselves as random variables, sampled from a distribu-
dardized mean difference, respectively, is to result in an tion of possible effect sizes. Chapter 20 of this volume
increase in the summary estimate. Because we identified and chapter 9 of Hedges and Olkin (1985) provide dis-
these two studies earlier as outliers, this result is not sur- cussions on the distinction between random and fixed ef-
prising. We also see from table 22.4 that studies 17 and fect models, the conceptual bases underlying random ef-
18 have the largest values of T among all of the studies fects models, and their application to research synthesis.
and that the effect of leaving each of these studies out of The discussion that follows assumes familiarity with
the combined estimate, respectively, is to result in a de- those chapters. The development of methods for sensitiv-
crease in the average standardized mean difference. In ity analysis for random effects models are an important,
summary, the results in this analysis are as we might ex- open research area. In fact, Hedges and Olkin did not
pect: remove a study that has a smaller-than-the-mean ef- refer to random effects models in their chapter on diag-
fect size, and the overall summary measure goes up. nostic procedures for research synthesis (1985). In this
In table 22.4, we also report the analysis
__ of leaving one section, we briefly outline the most commonly used ran-
study out using the weighted average, T w(i). We see simi- dom effects model and the use of sensitivity analysis for
lar results for the analysis with the unweighted average. random effects models in meta-analysis.
SENSITIVITY ANALYSIS AND DIAGNOSTICS 425

Table 22.5 Comparison of Fixed Effects Analysis to Random Effects Analysis

Fixed effects

Table Study k  SE()


95% c.i. for 

22.2 Independence 7 0.095 0.099 (0.288, 0.099)


22.1 K3 Self-Concept 7 0.034 0.081 (0.125, 0.192)
22.1 46 Self-Concept 11 0.001 0.050 (0.096, 0.099)

Random effects

2 for
Table Study k  SE()
95% c.i. for  
2
 H0:2  0

22.2 Independence 7 0.095 0.161 (0.411, 0.220) 0.113 26  15.50


22.1 K3 Self-Concept 7 0.037 0.066 (0.093, 0.167) 0.0* 26  9.56
22.1 46 Self-Concept 11 0.004 0.057 (0.116, 0.107) 0.007 102 13.96

source: Authors compilation.


*The unbiased estimate, 0.012, is inadmissible.

In the random effects approach, the true effect sizes Table 22.5 contains the results of the two analyses for
{i:i  1, . . . , k} are considered realizations of a random the data in tables 22.1 and 22.2. Recall that table 22.1
variable . That is, they are a random sample from a consists of studies measuring the effects of open educa-
population of effect sizes, which is governed by some tion on self-concept and includes grade level as a covari-
distribution with mean  and variance 2, a measure of ate. Table 22.2 consists of studies measuring the effects
the between-study variability. Given sample estimates, Ti, of open education on student independence and self-reli-
of i for i  1, . . . , k, the analysis proceeds by estimating ance, but providing no covariate information. For table
these parameters or performing tests on them. For in- 22.2, the estimates of (common or mean) effect size are
stance, a common starting point is the test of the hypoth- virtually the same (   0.095,    0.095), however,
esis 2  0, a test of the homogeneity of the effect sizes the SE for the random effects model, 0.161 is greater than
across the k studies: i  . . .  k  . Note that this com- that for the fixed effects model, 0.099. Next, the estimate
mon effect size, , in the fixed-effects model has a differ-  2  0.113 is significantly different from zero, in that the
ent interpretation than : Even if  is positive, there may corresponding chi-square value is 15.50, which exceeds
be realizations i that are negative (for more discussion, the 5 percent critical value of 12.59 for six degrees of
see Hedges and Olkin 1985; DerSimonian and Laird 1986; freedom. Because this test of complete homogeneity re-
Higgins and Thompson 2002). jects, the conservative (in the sense of having wider con-
An important example of sensitivity analysis is the fidence intervals) random effects analysis is deemed more
comparison of the results of fixed and random effects valid. In this case, the substantive conclusions may well
models for a particular data set. The output from a fixed be the same under both models, given that both confi-
effects analysis includes an estimate of the common ef- dence intervals contain zero. On the other hand, when the
fect size, ,
and its estimated standard error, SE. The out- two methods differ, it is important to investigate why they
put from a random effects analysis includes an estimate differ and to explicitly recognize that the conclusions are
of the mean of the distribution of effect sizes,  , its esti- sensitive to the modeling assumption.
mated standard error, SE(), an estimate of the variance The rest of table 22.5 compares the two analyses for
of the distribution of random effects 2, and the value of the data in table 22.1. In section 22.3, we saw that grade
the chi-squared test of the hypothesis of complete homo- level, the one covariate provided, helped to explain vari-
geneity. A confidence interval for the mean effect size, , ability among the effect-size estimates. Thus, neither the
is also provided (see Hedges and Olkin 1985). fixed nor the random effects models should be applied to
426 DATA INTERPRETATION

these data without modification. We present these data to and also need to be combined, so that hierarchical models
illustrate several points. First, we can compare the two on may be constructed; ancillary information about a ques-
subsets of the data defined by the levels of the covariate tion of interest is available, for example, either from ex-
(in this case, K3, four through six). Next, making infer- pert judgment or from previous studies, which represents
ences about 2 can be difficult, because a large number of the cumulative evidence about the question of interest that
primary studies are needed, and strong distributional as- needs to be systematically incorporated with current evi-
sumptions must be satisfied before using the procedures dence; modeling assumptions need to be examined sys-
Hedges and Olkin described (1985). In fact, for the K3 tematically, or models are sufficiently complicated that
data, the unbiased estimate of 2 is negative. Although it inferences about quantities of interest require substantial
is tempting to interpret this as supporting the hypothesis computation. The Bayesian hierarchical model integrates
that 2  0 and then to use the fixed effects analysis, more the fixed-effect and random-effect models discussed ear-
research is needed to find better methods of estimating 2 lier into one framework and provides a unified modeling
for such cases (for further discussion, see chapter 20, this approach to meta-analysis. The theory and applications
volume; Hill 1965.) Of course, the methods from explor- of this approach to the problems of research synthesis
atory data analysis discussed earlier are also useful for with illustrative examples can be found in the literature
sensitivity analysis. For instance, the leave-one-out method (see, for example, Hedges 1998; Normand 1999; Rauden-
can be readily used in a random effects context to infor- bush and Bryk 2002; Stangl and Berry 2000; Sutton et al.
mally investigate the influence of studies. 2000). The statistical software package WinBUGS is
Random effects models typically assume the normal available for implementing this approach (Gilks, Thomas,
distribution for the effect sizes (for continuous data), so and Spiegelhalter 1994). Here we briefly review the major
one concern is how sensitive the analysis is to this as- elements of the Bayesian approach for meta-analysis with
sumption. The two main steps in addressing this concern a special emphasis on the role of sensitivity analysis.
are to detect departures from normality of the random ef- As before, assume that there are k independent studies
fects, and then to assess the impact of non-normality on available for a research synthesis, and let Ti denote the
the combined estimates of interest. The weighted normal study-specific effect from study i (i  1, . . . , k). It is pos-
plot is a graphical procedure designed to check the as- sible that the differences in the Ti are not only due to
sumption of normality of the random effects in linear experimental error but also due to actual differences be-
models (Dempster and Ryan 1985). John Carlin used this tween the studies. If explicit information on the source of
diagnostic in his study combining the log of odds ratios heterogeneity is available, it may be possible to account
from several 2  2 tables (1992). If the normality of the for study-level differences in the modeling, but in the ab-
random effects is found to be inappropriate, several new sence of such information a useful and simple way to ex-
problems arise. For instance, the usual parameters (such press study-level heterogeneity is with a multistage or
as ) are no longer easily interpretable. For example, if hierarchical model. Let i2 denote the within-study com-
the distribution of effect sizes is sufficiently skewed, the ponent of variance associated with each outcome mea-
mean effect size can be positive (indicating that on aver- sure Ti. The first stage of the model relates the observed
age an efficacious result), yet for more than half of all the outcome measure Ti to the underlying study-specific ef-
effect sizes can be negative. One way to assess the sensi- fect in the ith study, i. At the second stage of the model,
tivity of the analysis to this kind of departure from the the is are in turn related to the overall effect  in the
normal assumption is to use skewed densities, such as the population from which all the studies are assumed to have
gamma or log-normal families. Carrying out this kind of been sampled, and 2 is the between-study component of
sensitivity analysis is computationally intensive. variance or the variance of the effects in the population.
The model, presented in two equivalent forms, is
22.4.3 Bayesian Hierarchical Models
Stage I: i
i, i2  N(i , i2) Ti  i i
Bayesian methods have emerged as powerful tools for
i  N(0, i2)
analyzing data in many complex settings, including meta-
analysis. Situations in which Bayesian methods are espe-
Stage II: i
 ,  2  N(, 2) i    i
cially valuable are those where: information comes from
many similar sources, which are of interest individually i  N(0,  2)
SENSITIVITY ANALYSIS AND DIAGNOSTICS 427

All the i and i are considered independent. From the Bayesian models as prior distributions. Various choices
Bayesian perspective, a number of unknown parameters, for prior distributions and the consequences of such
i2, , and 2 have to be estimated, and therefore proba- choices are discussed here.
bility distributions, in this context called prior distribu- In the use of Bayesian methods in practice, there is a
tions, must be specified for them. In effect, these latter concern that inferences based on a posterior distribution
probability specifications constitute a third stage of the will be highly sensitive to the specification (or misspecifi-
model. cation) of the prior distributions. A formal attempt to ad-
A key implication of the two-stage model is that there dress this concern is the use of robust Bayesian methods.
is information about a particular i from all of the other Instead of worrying about the choice of a single correct
study-specific effects, {1, . . . , k}. Specifically, an esti- prior distribution, the robust Bayesian approach considers
mate of i is a range of specifications and assesses how sensitive infer-
ences are to the different priors (see, for example, Kass
 i ()  (1 Bi()) Ti  Bi() (),
(22.1) and Greenhouse 1988; Spiegelhalter, Abrams, and Myles
2004). In other words, if the inferences do not change as
a weighted average of the observed ith study-specific ef- the prior distributions vary over a wide range of possible
fect, Ti, and the estimate of the population mean effect, distributions, then it is safe to conclude that the infer-
(),
where the weights are Bi()  i2/(i2   2). Although ences are robust to the prior specifications. E. E. Kaizar
most meta-analyses are interested in learning about  and and colleagues illustrated the use of Bayesian sensitivity
 2, it is informative to examine the behavior of equation analysis in a meta-analysis investigating the risk of anti-
22.1 as a function of . The weighting factor B() controls depressant use and suicidality in children (2006).
how much the estimate of  i shrinks toward the popula- The specification of a prior distribution for the be-
tion mean effect. When  is taken to be 0, B() is 1 and the tween-study component of variance, 2, is often challeng-
two-stage model is equivalent to a fixed-effects model. ing. In general, an improper prior distribution, that is, one
Small values of  describe situations in which studies that does not have a finite integral, is a poor choice for the
can borrow strength from each other and information is hierarchical models used here, because the posterior dis-
meaningfully shared across studies. Very large values of tribution may also be improper, precluding appropriate
 on the other hand, imply that B() is close to 0, so that inference and model selection. One choice of prior distri-
not much is gained in combining the studies. The Bayes- bution for the between study variance is to use a member
ian approach to meta-analysis, particularly through the of the family of distributions that is conjugate to the dis-
specification of a probability distribution for , provides a tribution that models the effects of the individual studies.
formalism for investigating how similar the studies are For example, the inverse gamma distribution is the conju-
and for sharing information among the studies when they gate distribution for the variance of a normal distribution
are not identical. To investigate the sensitivity of the pos- (see, for example, Gilks, Richardson, and Spiegelhalter
terior estimates of the study effects to , William Du- 1996; for an alternative view, see Gelman 2006). With
Mouchel and Sharon-Lise Normand (2000) recommended conjugate distributions, it is often possible to sample di-
constructing a plot of the conditional posterior mean rectly from the full conditional distribution simplifying
study effect,  i(), conditioned on a range of likely values software implementation (Robert 1994, 97106). Conju-
of  to see how  i() varies as a function of . They call gate prior distributions are appropriate in two situations.
this a trace plot and illustrate its use as well as discuss The first is when the expert opinion concerning the pos-
many other practical issues in the implementation of sible distribution of study effects can be approximately
Bayesian hierarchical models in meta-analysis. represented by some distribution within the conjugate
22.4.3.1 Specification of Prior Distributions The family. Second is when there is little background infor-
models considered here for combining information from mation about possible study to study variability, a weakly
different studies include the assumption that the various informative prior distribution can often be chosen from
studies are samples from a large population of possible the conjugate family.
studies and that the studies to be combined are exchange- A variety of nonconjugate prior distributions may also
able (Draper et al. 1993). Knowledge and assumptions be chosen to help improved representation of prior be-
about such study distributions, whether derived from liefs, despite the greater computational complexity they
experts or from ancillary sources, are incorporated into require. The DuMouchel prior distribution (DuMouchel
428 DATA INTERPRETATION

and Normand 2000), also called the log-logistic distribu- for example, there are a number of diagnostic tools, such
tion, is diffuse but proper. It has monotonically
__ __
decreasing as residual analysis and case analysis, that can be applied
density, and takes the form () s0 /(s0  )2, where  2 (meta-regression, Van Houwelingen, Arends, and Stijnen
is the study-to-study variance, and s0 is a parameter relat- 2002; diagnostic tools, Weisberg 1985; Hedges and Olkin
ing the variances of_____
the individual studies. DuMouchel 1985, chapter 12). In addition, as we have indicated, there


k are open research opportunities for the development of
recommends s0  _____ , where k is the number of studies
si1 new methods for sensitivity analysis.
and si is the variance of the ith study. This choice results
in the median of () being equal __
to s0, and
__
the first and
third quartiles being equal to s0 /3 and 3s0 respectively. 22.5 PUBLICATION BIAS
With a normal distribution for the study effects, the re-
One criticism of meta-analysis is that the available stud-
sulting shrinkage of the posterior estimates from the indi-
ies may not be representative of all studies addressing the
vidual data values toward the mean is such that there is
research question. Publication bias refers to the claim that
roughly a 25 percent probability of essentially no shrink-
studies with statistically significant results are more likely
age (each study stands on its own), a 25 percent probabil-
to be published than studies that do not. Robert Rosenthal
ity of essentially complete shrinkage (all studies have an
imagined that these unpublished studies ended up in in-
equivalent outcome effect), and a 50 percent chance that
vestigators files, and dubbed this phenomenon the file
the studies borrow strength from each other, that is, their
drawer problem (1979). Many sources of this problem
posterior distributions represent a compromise between
have been identified. One is reporting bias, which arises
the individual study effect and the mean of all studies.
in published studies where authors do not report results
Michael Daniels noted that the DuMouchel prior has
that are not statistically significant or provide insufficient
similar properties to the uniform shrinkage prior (1999).
information about them (Hedges 1988). Another is re-
DuMouchel and Jeffrey Harris provided an early ex-
trieval bias, which is due to the inability to retrieve all of
ample of sensitivity analysis for a meta-analysis using
the research, whether published or not (Rosenthal 1988;
Bayesian methods (1983). Their aim was to combine in-
for a discussion of other sources see Rothstein, Sutton,
formation from studies of the effects of carcinogens on
and Borenstein 2005).
humans and various animal species. Their data set con-
sisted of a matrix (rows for species, columns for carcino-
gens) of the estimated slope of the dose-response lines 22.5.1 Assessing the Presence
from each experiment. They postulated a linear model for of Publication Bias
the logarithm of these slopes. In addition, they adopted a
Figure 22.4 is an example of a funnel plot which Richard
hierarchical model for the parameters of the linear model.
Light and David Pillemer introduced for the graphical de-
That is, the parameters of the linear model were them-
tection of publication bias (1984). A funnel plot is a scat-
selves random variables with certain specified prior dis-
terplot of sample size versus estimated effect size for a
tributions. In this context, an example of a sensitivity
group of studies. Because small studies will typically
analysis that they undertook is to vary the prior distribu-
show more variability among the effect sizes than larger
tion to see how the inferences are affected. They also
studies, and there will be fewer of the latter, the plot
used a variant of the leave-one-out method by leaving out
should look like a funnel, hence its name. When there is
an entire row or an entire column from their data matrix
publication bias against, say, studies showing small effect
to investigate the fit of their model.
sizes, a bite will be taken out of a part of this plot. In fig-
22.4.4 Other Methods ure 22.4, for example, the plot is skewed to the right, and
there appears to be a sharp cutoff of effect sizes at 0.30,
There are, of course, many other analytic methods used in suggesting possible publication bias against small or neg-
research synthesis that we could consider from the view- ative effect sizes. Figure 22.5, on the other hand, shows a
point of sensitivity analysis, for example, regression anal- funnel plot for a meta-analysis where there is no publica-
ysis. Our aim in this section is to alert the reader to the tion bias: it is a plot of 100 simulated studies comparing
need for sensitivity analysis at the data combining stage treatment to control. For the ith study, there are ni obser-
of a research synthesis. The details depend on the meth- vations per group, where ni span the range 10 to 100. The
ods used to combine data. For meta-regression analysis, population effect size was set at 0.50. Notice the funnel
SENSITIVITY ANALYSIS AND DIAGNOSTICS 429

100 100

80 80

Sample Size
Sample Size

60 60

40 40

20 20

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Effect Size Effect Size

Figure 22.4 Funnel Plot, Only Studies Statistically Signifi- Figure 22.5 Funnel Plot for All Studies
cant at the 0.05 Level Reported source: Authors compilations.
source: Authors compilation.
metry to estimate the number of missing studies, and then
shape. The funnel plot is thus a useful visual diagnostic impute those studies to get less biased estimates of the
for informally assessing the file drawer problem. effect size (see, however, Terrin et al. 2003; Copas and
Two simple biasing mechanisms are depicted in fig- Shi 2000). The trim-and-fill method has become rather
ures 22.4 and 22.6. The data in both are derived from that popular, in part because of its simplicity (for further dis-
of figure 22.5. In figure 22.4, all studies that report statis- cussion, see chapter 23, this volume).
tically significant results at the 0.05 level are shown. In Having used this plot as a diagnostic tool to indicate
figure 22.6, significant results have an 80 percent chance the presence of publication bias, the next step is to decide
of being reported, and the rest only a 10 percent chance just how to adjust the observed combined effect size esti-
(for illustrations of other mechanisms, see chapter 23, mate for the bias. Suppose, for the moment, that an ad-
this volume). justment method is agreed upon. A sensitivity analysis
The funnel plot, along with its variations, is certainly a would find the adjusted values for a variety of possible
useful visual aid for detecting publication bias (Galbraith selection mechanisms. If these values are not far from the
1988; Vandenbroucke 1988). Susan Duval and Richard unadjusted average, the latter is deemed robust, or insen-
Tweedie (2000a, 2000b) noted that these variants can sitive to publication bias. Otherwise, the unadjusted aver-
give very different impressions of the extent of publica- age should not be trusted unless a subsequent search for
tion bias and proposed a quantitative complement to the more studies still supports it. In practice, using a single
funnel plot known as the trim-and-fill method. They used adjustment method is usually not warranted. It is better to
the fact that when there is no bias, the estimates should be apply a range of plausible adjustment methods to see how
symmetric about the true effect size. They use this sym- robust the observed value is.
430 DATA INTERPRETATION

100 lished studies. Adjusted estimates of the effect size are


then derived using standard maximum likelihood or
Bayesian methods (for technical details, see the appendix
and chapter 23, this volume).
Several comments about this approach are in order.
80 First, it is flexible in the sense that it can also be used to
model other types of biases, such as retrieval bias. For
example, if p1 and p2 are the probabilities of retrieving a
published and an unpublished study, respectively, the
probability of a study entering a meta-analysis at all may
Sample Size

60
be modeled as p1w(z) for a published study and as p2w(z)
for an unpublished one. Several empirical studies of the
retrieval process have been done, and may be useful in
providing rough guidelines for determining ranges for p1
40 and p2 (Chalmers, Frank, and Reitman 1990; Dickersin
et al. 1985). Next, because the weight function is typically
not known, it is useful to do the analysis of the weighted
models with various plausible families of weight func-
tions. If the resulting effect size estimates do not change
20
much, such a sensitivity analysis will give more credence
to the inferences based on the retrieved studies (Iyengar
and Greenhouse 1988).
There are several theoretical and empirical studies of
0.0 0.2 0.4 0.6 0.8 1.0 1.2 weight functions. Satish Iyengar and Ping Zhao (1994)
Effect Size studied how the MLE for the normal and student t distri-
Figure 22.6 Funnel Plot for a Biased Reporting Mechanism butions changes as the weight function varies. In particu-
lar, they showed that for commonly used weight function
source: Authors compilation.
note: See text for details. families, the MLE varies monotonically with a parameter
that controls the amount of selection bias within that fam-
ily, so computing the range of the estimates is easy. Other
22.5.2 Assessing the Impact of Publication Bias studies of this approach include Iyengar, Paul Kvam, and
Singh, who considered the effect of weight functions on
Publication bias can be dealt with in several ways once it the Fisher information of the effect size (1998). This in
has been detected, or suspected. One approach explicitly turn dictates the accuracy of the MLE; and Larry Hedges
incorporates the biasing or selection mechanism into the and Jack Vevea, who provided extensive empirical stud-
model to get an adjusted overall estimate of the effect size. ies of the use of weighted distributions when selection is
For instance, suppose that the biasing mechanism favors related both to the estimated effect size and its estimated
publication of statistically significant results, and that the standard deviation (2005).
primary studies report one-sided z-tests using the 0.05 We end this discussion with a numerical example il-
level, which has a cutoff of z  1.645. That is, the proba- lustrating the use of weighted distributions for the data
bility, w(z), of publication of a studys results increases in table 22.1. Suppose that all studies use a two-sided
with its z-value. An extreme case is when only the statisti- z-test at the 0.05 level, which has cutoffs 1.96 and 1.96.
cally significant results are published, so that Suppose further that the bias is such that a study with a
significant result will definitely be published, and that
w(z)  0 if z 1.645
{ other studies will be published with some probability less
1 if z  1.645.
than 1. A weight function that models this situation is
The probability w(z) is known as a weight function
(Rao 1985). In this approach, the weight function is used w(z;b)  e if 1.96 z 1.96
to construct the likelihood function based on the pub- {
1 else.
SENSITIVITY ANALYSIS AND DIAGNOSTICS 431

This choice of w is actually a member of a family of substantially the same conclusions. Sensitivity analysis
weight functions parameterized by a nonnegative quan- and diagnostics as described in this chapter provide a
tity , indicating uncertainty about the probability of pub- conceptual framework and the technical tools for doing
lication when an effect size is not significant; the larger just that. On the other hand, a meta-analysis is useful
the value of , the greater the degree of publication bias. even if the conclusions are sensitive to such alternatives,
Maximum likelihood estimation of the two parameters , for then the sensitivity analysis points to the features of
the true effect size, and , the bias parameter yields the problem that deserve further attention.
0.006 (SE  0.033) and 0.17 (SE  0.192), respectively.
Because the estimate of the true effect size based on the
unweighted model is 0.008, we conclude that the result 22.7 APPENDIX
is not sensitive to this choice of publication bias mecha- Suppose that w(z) is interpreted as the probability of pub-
nism. The estimate of indicates that the probability of lishing a study when the effect size estimate from that
publishing a statistically nonsignificant result is about study is z. If an effect size estimate is normally distrib-
0.85. Note, however, that the standard error for the esti- uted, the published effect size estimate from a study is a
mate of is relatively large, indicating that the data pro- random variable from the weighted normal probability
vide little information about that parameter. density

(z  ;2/n)w(x)
22.5.3 Other Approaches f (z;,w)  _______________
A(,2/n;w)
Other approaches to sensitivity analysis for dealing with
where f(z;s2) is the density of the normal distribution with
publication bias include the fail-safe sample size approach
mean zero and variance s2,
of Robert Rosenthal (1979). This uses combined signifi-
cance levels rather than effect size estimates. The fail-safe 
A(,2;w)  
(t;2)w(t) dt
N is then the smallest number of studies showing nonsig- 
nificant studies to overturn the result based on the re-
trieved studies (for a critique of and for variations on this is the normalizing constant, and n and  are the sample
approach, see Iyengar and Greenhouse 1988; for a Bayes- size and the true effect size for that study, respectively.
ian approach, see Bayarri 1988). Finally, John Copas and Inference about the effect size can then be based on this
J. Q. Shi proposed a latent variable that models the publi- weighted model, using maximum likelihood or Bayes es-
cation process: the variable depends linearly on the recip- timates. Because the weighted model is also a member of
rocal of the studys standard error, it is correlated with the an exponential family, an iterative procedure for comput-
effect size estimate, and allows the study to be seen when ing the estimate is relatively easy. Often, primary studies
it is positive (2000). They then varied the slope and inter- report t statistics with their degrees of freedom; in that
cept of the model which tune the degree of publication case, a noncentral t density replaces the normal density
bias to assess the effects on the overall effect size. above, and the computations become considerably more
difficult (for an example involving the combination of t
statistics, see Iyengar and Greenhouse 1988). When the
22.6 SUMMARY degrees of freedom within each primary study are fairly
Sensitivity analysis is a systematic approach to address large, however, the normal approximation is quite accu-
the question, What happens if some aspect of the data or rate.
the analysis is changed? Because at every step of a re-
search synthesis decisions and judgments are made that 22.8 REFERENCES
can affect the conclusions and generalizability of the re-
sults, we believe that sensitivity analysis has an important Bayarri, Mary J. 1988. Discussion. Statistical Science 3(1):
role to play in research synthesis. For the results of a 10935.
meta-analysis to be convincing to a wide range of read- Borenstein, Michael, and Hannah Rothstein. 1999. Com-
ers, it is important to assess the effects of these decisions prehensive MetaAnalysis: A Computer Program for Re-
and judgments. One way to do this is to demonstrate that search Synthesis. Englewood, N.J.: Biostat, Inc. http://www.
a broad range of viewpoints and statistical analyses yield meta-analysis.com.
432 DATA INTERPRETATION

Carlin, John B. 1992. Meta-Analysis for 2  2 Tables: A Gaver, Donald P., David Draper, Prem K. Goel, Joel B.
Bayesian Approach. Statistics in Medicine 11(2): 14158. Greenhouse, Larry V. Hedges, Carl N. Morris, and Chris-
Chalmers, Thomas, C. Frank, and D. Reitman. 1990. Mini- tine Waternaux. 1992. Combining Information: Statistical
mizing the Three Stages of Publication Bias. Journal of Issues and Opportunities for Research. Washington, D.C.:
the American Medical Association. 263(10): 139295. National Academy Press.
Copas, John B. and Shi, Jian Qing. 2000. Meta-Analysis, Gelman, Andrew. 2006. Prior Distributions for Variance Pa-
Funnel Plots and Sensitivity Analysis. Biostatistics 1(3): rameters in Hierarchical Models. Bayesian Analysis 1(3):
24762. 51533.
Daniels Michael J. 1999. A Prior for the Variance in Hierar- Gilks, Walter R., Sylvia Richardson, and David J. Spiegel-
chical Models. The Canadian Journal of Statistics 27(3): halter, eds. 1996. Markov Chain Monte Carlo in Practice.
56778. London: Chapman & Hall.
Dempster, Arthur, and Louise M. Ryan. 1985. Weighted Gilks, Walter R., A. Thomas, and David J. Spiegelhalter.
Normal Plots. Journal of the American Statistical Associa- 1994. A Language and Program for Complex Bayesian
tion 80(392): 84550. Modeling. The Statistician: Journal of the Institute of Sta-
DerSimonian, Rebecca, and Nan Laird. 1986. Meta-Analysis tisticians 43(1): 16977.
in Clinical Trials. Controlled Clinical Trials 7(3): 17788. Greenhouse, Joel B., Davida Fromm, Satish Iyengar, Mary
Dickersin, Kay, P. Hewitt, L. Mutch, Iain Chalmers, and A. Dew, Audrey Holland, and Robert Kass. 1990. The Ma-
Thomas Chalmers. 1985. Perusing the Literature: Compa- king of a Meta-Analysis: A Quantitative Review of the
rison of MEDLINE Searching with a Perinatal Trials Data- Aphasia Treatment Literature. In The Future of Meta-Ana-
base. Controlled Clinical Trials 6(4): 30617. lysis, edited by Kenneth W. Wachter and Miron L. Straf.
Draper, David, James S. Hodges, Colin L. MallowsL, and New York: Russell Sage Foundation.
Daryl L. Pregibon. 1993. Exchangeability and Data Ana- Hedges, Larry V. 1988. Directions for Future Methodo-
lysis (with discussion). Journal of the Royal Statistical logy. In The Future of Meta-Analysis. edited by Kenneth
Society, A 156: 937. W. Wachter and Miron L. Straf. New York: Russell Sage
DuMouchel, William, and Jeffrey Harris. 1983. Bayes Foundation.
Methods for Combining the Results of Cancer Studies in . 1998. Bayesian Approaches to Meta-Analysis. In
Humans and other Species. Journal of the American Sta- Recent Advances in the Statistical Analysis of Medical
tistical Society 78(382): 293315. Data, edited by Brian Everitt and Graham Dunn. London:
DuMouchel, William, and Sharon-Lise Normand. 2000. Edward Arnold.
Computer-Modeling and Graphical Strategies for Meta- Hedges, Larry V., and Ingram Olkin. 1985. Statistical
Analysis. In Meta-Analysis in Medicine and Health Po- Methods for Meta-Analysis. Academic Press. New York.
licy, edited by Dalene Stangl and Don Berry. New York: Hedges, Larry V., and Jack Vevea. 2005. Selection Model
Marcel Dekker. Approaches to Publication Bias. In Publication Bias in
Duval, Sue and Richard Tweedie. 2000a. A Non-Parametric Meta-Analysis: Prevention, Assessment and Adjustments,
Trim and Fill Method of Assessing Publication Bias in edited by Hannah Rothstein, Alexander J. Sutton, and Mi-
Meta-Analysis. Journal of the American Statistical Asso- chael Borenstein. New York: John Wiley & Sons.
ciation 95(449): 8998. Hedges, Larry V., Rose Giaconia, and Nathaniel Gage. 1981.
Duval, Susan, and Richard Tweedie. 2000b. Trim and Fill: The Empirical Evidence on the Effectiveness of Open Edu-
A Simple Funnel Plot Based Method of Testing and Adjus- cation. Stanford, Calif.: Stanford University.
ting for Publication Bias in Meta-Analysis. Biometrics Higgins, Julian, and Simon Thompson. 2002. Quantifying
56(2): 45563. Heterogeneity in a Meta-Analysis. Statistics in Medicine
Fienberg, Stephen, Matthew Johnson, and Brian Junker. 21(11): 153958.
1999. Classical Multilevel and Bayesian Approaches to Hill, Bruce M. 1965. Inference about Variance Components
Population Size Estimation using Multiple Lists. Journal in the One-Way Model. Journal of the American Statisti-
of the Royal Statistical Society, Ser. A, 162(3): 383405. cal Association 60(311): 80625.
Galbraith, R. 1988. A Note on Graphical Presentation of Iyengar, Satish, and Joel B. Greenhouse. 1988. Selection
Estimated Odds Ratios from Several Clinical Trials. Sta- Models and the File-Drawer Problem (with discussion).
tistics in Medicine 7(8): 88994. Statistical Science 3(1): 10935.
SENSITIVITY ANALYSIS AND DIAGNOSTICS 433

Iyengar, Satish, Paul Kvam, and Harshinder Singh. 1999. . 1988. An Evaluation of Procedures and Results. In
Fisher Information in Weighted Distributions. Canadian The Future of Meta-Analysis, edited by Kenneth W. Wachter
Journal of Statistics 27(4): 83341. and Miron L. Straf. New York: Russell Sage Foundation.
Iyengar, Satish, and Ping L. Zhao. 1994. Maximum Likeli- Rothstein, Hannah, Alexander J. Sutton, and Michael Bo-
hood Estimation for Weighted Distributions. Probability renstein, eds. 2005. Publication Bias in Meta-Analysis:
and Statistics Letters 21: 3747. Prevention, Assessment and Adjustments. New York: John
Kaizar, Eloise E., Joel B. Greenhouse, Howard Seltman, Wiley & Sons.
Kelly Kelleher. 2006. Do Antidepressants Cause Suicida- Spiegelhalter, David J., Keith R. Abrams, and Jonathan P.
lity in Children? A Bayesian Meta-Analysis (with discus- Myles. 2004. Bayesian Approaches to Clinical Trials and
sion). Clinical Trials: Journal of the Society for Clinical Health-Care Evaluation. Chichester: John Wiley & Sons.
Trials 3(2): 7398. Stangl, Dalene K., and David A. Berry, eds. 2000. Meta-
Kass Rob, and Joel B. Greenhouse. 1989. Comment on In- Analysis in Medicine and Health Policy. New York: Marcel
vestigating therapies of potentially great benefit: ECMO Dekker.
by J. Ware. Statistical Science 4(4): 31017. Sutton, Alexander J., Keith R. Abrams, David R. Jones, Tre-
Koopmans, Lambert. 1987. Introduction to Contemporary vor A. Sheldon, and Fujian Song. 2000. Methods for Meta-
Statistical Methods. Boston, Mass.: Duxbury Press. analysis in Medical Research. Chichester: John Wiley &
Light, Richard, and David Pillemer. 1984. Summing Up. Sons.
Cambridge, Mass.: Harvard University Press. Terrin, Norma, Chris Schmid, Joe Lau, and Ingram Olkin.
Normand, Sharon-Lise. 1999. Meta-Analysis: Formulating, 2003. Adjusting for Publication Bias in the Presence of
Evaluating, Combining, and Reporting. Statistics in Medi- Heterogeneity. Statistics in Medicine 22(13): 211326.
cine 18(3): 32159. Van Houwelingen, Hans C., Lidia R. Arends, and Theo Sti-
Rao, C. Radhakrishna. 1985. Weighted Distributions Ari- jnen. 2002. Advanced Methods in Meta-Analysis: Multi-
sing out of Methods of Ascertainment: What Population variate Approach and Meta-Regression. Statistics in Medi-
Does a Sample Represent? In A Celebration of Statistics, cine 21(4): 589624.
edited by Anthony C. Atkinson and Stephen E. Fienberg. Vandenbroucke, Jan P. 1988. Passive Smoking and Lung
New York: Springer. Cancer: A Publication Bias? British Medical Journal 296:
Raudenbush, Stephen W., and Anthony S. Bryk. 2002. Hie- 39192.
rarchical Linear Models: Applications and Data Analysis Wachter, Kenneth, and Miron Straf. 1990. The Future of
Methods, 2nd ed. Newbury Park, Calif.: Sage Publications. Meta-Analysis. New York: Russell Sage Foundation.
Robert, Christian P. 1994. The Bayesian Choice. New York: Weisberg, Sanford. 1980. Applied Linear Regression. New
Springer-Verlag. York: John Wiley & Sons.
Rosenthal, Robert. 1979. The File-Drawer Problem and . 1985. Applied Linear Regression, 2nd. ed. Chiches-
Tolerance for Null Results. Psychological Bulletin 86: ter: John Wiley & Sons.
63841.
23
PUBLICATION BIAS

ALEX J. SUTTON

CONTENTS
23.1 Introduction 436
23.2 Mechanisms 437
23.3 Methods for Identifying Publication Bias 437
23.3.1 The Funnel Plot 437
23.3.2 Statistical Tests 440
23.3.2.1 Nonparametric Correlation Test 440
23.3.2.2 Linear Regression Tests 440
23.4 Methods for Assessing Impact of Publication Bias 442
23.4.1 Fail Safe N 442
23.4.2 Trim and Fill 443
23.4.3 Selection Modeling 444
23.4.3.1 Suppression as a Function of p-Value Only 445
23.4.3.2 Suppression as a Function of Effect Size and Its Standard Error 446
23.4.3.3 Simulation Approaches 446
23.4.3.4 Bayesian Approaches 446
23.4.4 Excluding Studies Based on Their Size 447
23.5 Methods to Address Specific Dissemination Biases 447
23.5.1 Outcome Reporting Biases 448
23.5.2 Subgroup Reporting Biases 448
23.5.3 Time-Lag Bias 448
23.6 Discussion 448
23.7 References 449

435
436 DATA INTERPRETATION

23.1 INTRODUCTION with advances in electronic publishing making the pre-


sentation of large amounts of information more economi-
That the published scientific literature documents only a cally viable than traditional paper-based publishing meth-
proportion of the results of all research carried out is a ods, there is some hope that the problem will diminish, if
perpetual concern. Further, there is good direct and indi- not disappear. No evidence yet exists to indicate this is
rect evidence to suggest that the unpublished proportion the case, however, nor is this a solution to the suppression
may be systematically different from the published, since of information due to vested economic interests (Halpern
selectivity may exist in deciding what to publish. (Song and Berlin 2005).
et al. 2000; Dickersin 2005) For example, researchers may The use of prospective registries of prospectively stud-
choose not to write up and submit studies with uninterest- ies for selecting studies to be included in systematic
ing findings (such as nonstatistically significant effect reviews has been suggested (Berlin and Ghersi 2005), be-
sizes), or such studies may not be accepted for publica- cause it provides an unbiased sampling frame guarantee-
tion. Even if a study is published, there may be selectivity ing the elimination of publication bias (relating to the
in which aspects are written up. For example, significant suppression of whole studies at least). However, trial reg-
outcomes (Chan et al. 2004) or subgroups may be given istration, per se, does not guarantee availability of data
precedent over nonsignificant ones. There is also evidence and an obligation to disclose results in a form that is ac-
to suggest that studies with significant outcomes are pub- cessible to the public is also required. Further, registries
lished more quickly than those with nonsignificant out- exist for randomized controlled trials in numerous medi-
comes (Stern and Simes 1997). Additionally, there are cal areas. Such a solution will not be feasible, however,
concerns that research with positive and statistically sig- even in the long term, for some forms of research, includ-
nificant results are published in more prestigious places ing that relating to analysis of (routine) observational
and cited more times, making it more visible, and hence data, where the notion of a study that can be registered
easier to find. Indeed, the publication process should be before analysis may be nonexistent. The notion of pro-
thought of as a continuum and not a dichotomy (Smith spectively designing multiple studies with the intention
1999). In keeping with the previous literature, these (whole) of carrying out a meta-analysis in the future has also been
publication and related biases will be referred to simply put forward as a solution to the problem (Berlin and Gh-
as publication bias throughout the chapter, unless a more ersi 2005), but again, this may be difficult to orchestrate
specific type of bias is being discussed, though dissemi- in some situations.
nation bias is perhaps a more accurate name for the col- Carrying out as comprehensive a search as possible
lection (Song al. 2000). when obtaining literature for a synthesis will help mini-
In areas where any such selectivity exists, a synthesis mize the influence of publication bias. In particular, this
based on only the published results will be biased. Publi- may involve searching the grey literature for studies not
cation bias is therefore a major threat to the validity of formally published (see chapter 6 in this volume; Hopewell,
meta-analysis and other synthesis methodologies. How- Clarke, and Mallett 2005). Since the beginning of the
ever, it should be pointed out that this is not an argument Internet, the feasibility and accessibility of publication
against systematic synthesis of evidence, because publi- by means other than commercial publishing houses has
cation bias affects the literature and hence any methodol- greatly increased. This implies that the importance of
ogy used to draw conclusions from the literature (using searching the grey literature may increase in the future.
one or more studies) will be similarly threatened. Thus, Despite researchers best efforts, at least in the current
publication bias is fundamentally a problem with the way climate, alleviation of the problem of publication bias
individual studies are conducted, disseminated, and inter- may not be possible in many areas of science. In such
preted. By conducting a meta-analysis, we can at least instances, graphical and statistical tools have been devel-
attempt to identify and estimate the likely effect of such oped to address publication bias within a meta-analysis
bias by considering the information contained in the dis- framework. The remainder of this chapter provides an
tribution of effect sizes from the available studies. This is overview of these methods, which aim to detect and ad-
the basis of the majority of statistical methods described just for such biases. It should be pointed out that if a re-
later. search synthesis does not contain a quantitative synthesis
There is agreement that prevention is the best solution (for example, if results are considered too heterogeneous,
to the problem of selectively reported research. Indeed, or the data being synthesized are not quantitative) then,
PUBLICATION BIAS 437

though publication bias may still be a problem, methods 24


to deal with it are still very limited.
20

1/standard error
23.2 MECHANISMS 16
12
Discussion in the literature about the precise nature of
the mechanisms that lead to suppression of whole studies 8
and other forms of publication bias is considerable. If
4
these mechanisms could be accurately specified and quan-
tified, then, just like any bias, the appropriate adjustments 0
to a meta-analytic dataset to counteract might be fairly 0.05 0.1 0.2 0.5 1 2 5 10 20
straightforward. Unfortunately, measuring such effects Outcome - (ln(odds of death))
directly, globally, or in specific datasets is difficult and, of
course, the mechanisms may vary with dataset and sub- Figure 23.1 Symmetric Funnel Plot, Overall Mortality
ject area. source: Bown et al. 2002.
What is generally agreed is that the statistical signifi-
cance, effect size, and study size could all influence the
likelihood of a study being published, although this list is see chapter 26, this volume). It is essentially a scatter plot
probably not exhaustive (Begg and Berlin 1988). The of a measure of study size against a measure of effect
sections that follow outline methods to identify and ad- sizes. The expectation is that is should appear symmetric
just for publication bias. It is important to note that these with respect to the distribution of effect sizes and funnel
assume different underlying mechanisms for publication shaped if no bias is present. The premise for this appear-
biassome that suppression is only a function of effect ance is that effect sizes should be evenly distributed (that
size, others that it is only a function of statistical signifi- is, symmetric) around the underlying true effect size,
cance (whether this is assumed one or two sided is open with more variability in the smaller studies than the larger
to debate), and still others that it is a function of these ones because of the greater influence of sampling error
or other factors, such as origin of funding for a study and (that is, producing a narrowing funnel shape as study
the like. precision increases). If publication bias is present, we
Although the primary goal here is to provide a prag- might expect some suppression of smaller, unfavorable,
matic and up-to-date overview of the statistical methods and nonsignificant studies that could be identified by a
to address publication bias in meta-analysis, an emphasis gap in one corner of the funnel and hence could induce
has been placed on highlighting the assumptions made asymmetry in the plot.
regarding the underlying suppression mechanisms as- Example plots are provided in figures 23.1 and 23.2.
sumed by each method. It is hoped that insight into how Note that in these plots the outcome measure is plotted on
the different methods relate to each other can be facili- the x-axis, the most common way such plots are pre-
tated through this approach. That the actual mechanisms sented, though it goes against the convention that the un-
are usually unknown makes being absolutist about rec- known quantity is plotted on the y-axis. Funnels in both
ommendations on use of the different methods difficult. orientations exist in the literature. These are real data re-
This issue is considered further in the discussion. lating to total mortality and mortality during surgery fol-
lowing ruptured abdominal aortic aneurysm repair from
23.3 METHODS FOR IDENTIFYING published (observational) patient case-series (Bown et al.
PUBLICATION BIAS 2002; Jackson, Copas, and Sutton 2005). They are in-
cluded here because they exemplify a reasonably sym-
23.3.1 The Funnel Plot metrical funnel plot (figure 23.1) and a more asymmetric
one (figure 23.2) with a lack of small studies with small
The funnel plot is generally considered a good explor- effects in the bottom right-hand corner. Also interesting
atory tool for investigating publication bias and, like the is that this would appear to be a good example of out-
more commonly used forest plot, for a visual summary of come reporting bias given that only seventy-seven studies
a meta-analytic dataset (Sterne, Becker, and Egger 2005; report mortality during surgery (figure 23.2) out of the
438 DATA INTERPRETATION

8 are included, density problems may arise (see figures


23.1 and 23.2). The spread of the size of the studies is
6 another important factor. A funnel would be difficult to
1/standard error

interpret, for example, if all the studies were a similar


size. Ultimately, any assessment of whether publication
4
bias is present using a funnel plot is subjective (Lau et al.
2006). A recent empirical evaluation also suggests that
2 interpretation of such a plot can be limited (Terrin, Schmid,
and Lau 2005).
0 There has been debate over the most appropriate axes
0.01 0.05 0.2 0.5 1 2 5 20 for the funnel plot (Sterne and Egger 2001), particularly
Outcome - (ln(odds of death)) with respect to the y-axis (the measure of study precision),
with sample size, variance (and its inverse) and standard
Figure 23.2 Asymmetric Funnel Plot, Mortality During error (and its inverse) all being options. This choice can
Surgery affect the appearance of the plot considerably. For in-
source: Authors compilation. stance, if the variance or standard error is used, then the
top of the plot is compressed while the bottom of the plot
is expanded compared to when using inverse variance,
171 reports included in figure 23.1 for total mortality. The inverse standard error, or sample size. This has the effect
gap at the bottom right of the surgery mortality plot (fig- of giving more space to the smaller studies in which pub-
ure 23.2) suggests that studies in which mortality during lication bias is more likely. Example comparative plots
surgery had high rates were less likely to report this out- have been published elsewhere (Sterne and Egger 2001).
come than studies with lower rates. If variance or standard error or their inverses are used, it
For this example, because the data are noncompara- is possible to construct expected 95 percent confidence
tive, statistical significance of outcome is not defined and intervals around the pooled estimate that form guidelines
hence it is assumed that suppression is related to magni- for the expected shape of the funnel (under a fixed effect
tude of outcome (example funnel plots for comparative assumption) that can aid the interpretation of the plot.
data where statistical significance potentially influences Such guidelines should be thought of as approximate be-
publication are considered below). It is important to em- cause they are constructed around the meta-analytic pooled
phasize that funnel plots are often difficult to interpret. estimate, that may itself be biased because of publication
For example, close scrutiny of figures 23.1 and 23.2 sug- bias. The assistance in interpreting the plot is in indicat-
gests that, though producing a reasonably symmetric fun- ing regions where studies are to be expected if no sup-
nel in outline, the density of points is not consistent within pression exists. These guidelines can also help to indicate
the funnel, a higher density being apparent on the right how heterogeneous the studies are (see chapter 24, this
side in figure 23.1, particularly for the lower third of the volume) by observing how dispersed the studies are with
plot, suggesting some bias. respect to the guidelines.
It is interesting to note that study suppression caused It is important to appreciate that funnel plot asymmetry
by study size, effect size, or statistical significance (one may not be related to publication bias at all, but to other
sided), either individually or in combination, could pro- factors instead. Any factor associated both with study size
duce an asymmetric funnel plot. A further possibility is and effect size could potentially confound the observed
that two-sided statistical significance suppression mecha- relationship. For example, smaller studies may be carried
nisms (that is, significant studies in either direction are out under less rigorous conditions than larger studies.
more likely to be published than those that are not) could These poorer quality smaller studies may be more prone
create a tunnel or hole in the middle of the funnel, par- to bias and thus result in exaggerated effect size estimates.
ticularly when the underlying effect size is close to 0. Conversely, smaller studies could be conducted under
Certain issues need to be considered when construct- more carefully controlled experimental conditions than
ing and interpreting funnel plots. Generally, the more the larger studies, resulting in differences in effect sizes.
studies in a meta-analysis, the easier it is to detect gaps Further, there may be situations where a higher intensity
and hence potential publication bias. If numerous studies of the intervention were possible for the smaller studies
PUBLICATION BIAS 439

0 0

.1 .1

Standard error
Standard error

.2 .2

.3 .3

.4 .4
.5 0 .5 1 1.5 .5 0 5 1 1.5
Standardized effect size Standardized effect size

Figure 23.3 Funnel Plot, Teacher Expectancy on Pupil IQ, Figure 23.4 Funnel Plot, Teacher Expectancy on Pupil IQ,
95% Guidelines prior contact One Week or Less
source: Authors compilation. source: Authors compilation.

0
that lead to the true effect sizes in the smaller studies
being more efficacious than those for the larger studies.
If situations like these exist, then there is a relationship
Standard error

between-study size and effect size that would cause a .1


funnel plot of the studies to be asymmetric, which could
be wrongly diagnosed as publication bias. A graphical
way of exploring potential causes of heterogeneity in a .2
funnel plot would be to plot studies with different values
for a covariate of interest (for example, study quality)
using a different symbol or using multiple funnel plots of
subgrouped data and exploring whether this could ex- .3
plain differences between study results. For some out- .5 0 .5
come measures, such as odds ratios, the estimated effect Standardized effect size
size and its standard error are correlated and the more
extreme the effect size is, the stronger the correlation. Figure 23.5 Funnel Plot, Teacher Expectancy on Pupil IQ,
This implies that the funnel plot on such a scale will not prior contact More Than One Week
be symmetrical even if no publication bias is present. The source: Authors compilation.
induced asymmetry, however, is quite small for small and
moderate effect sizes (Peters et al. 2006).
Example: Effects of Teacher Expectancy on Pupil IQ. suggestion of a gap in the bottom left-hand corner of the
Figure 23.3 presents a funnel plot with the approximate funnel, making it asymmetric, that is, possibly a symp-
expected 95 percent confidence intervals around the tom of publication bias. However, the weeks of teacher-
pooled estimate for the expected shape of the funnel student contact before expectancy induction is a known
under a fixed effect assumption plotting standard error on modifier for this dataset, where effect sizes are systemati-
the y-axis. The guidelines are straight lines and thus the cally higher in the studies with lower previous contact
theoretical underlying shape of the funnel is also straight (see chapter 16, this volume). Figures 23.4 and 23.5) pres-
sided. Because the majority of the points (seventeen of ent funnel plots of the subgroups for prior contact of one
nineteen) lie within the guidelines, there is little evidence week or less and more than one week respectively. These
of heterogeneity between studies. There is, though, some funnels are centered around quite different effect sizes
440 DATA INTERPRETATION

and there is little evidence of asymmetry in either plot, is the ranks of both Ti* and ~
vi* are higher or lower for one
though splitting the data reduces numbers of studies in study compared to the other), and the number in which
each plot, which does hamper interpretation. Hence it the ordering is reversed as Q, (that is the rank of Ti* is
would appear that asymmetry in figure 23.3 was probably higher than that of the paired study while the rank of ~vi* is
caused by confounding by the moderator, prior contact lower, or vice versa). A normalized test statistic is ob-
time, rather than publication bias (Sterne et al. 2005). tained by calculating
In the next section, we present formal statistical tests
of asymmetry in a funnel plot. These get around some of Z  (P  Q)/[k(k  1)(2k  5)/18], (23.4)
the issues relating to subjectivity of assessment of funnel
plots and induced asymmetry on some outcome scales. which is the normalized Kendall rank correlation test sta-
They are all, however, tests for asymmetry and not publi- tistic and can be compared to the standardized normal
cation bias per se given that they do not (automatically) distribution. Any effect-size scale can be used as long as
adjust for confounding variables. it is assumed distributed asymptotic normal.
The test is known to have low power for meta-analyses
23.3.2 Statistical Tests that include few studies (Sterne, Gavaghan, and Egger
2000). From comparisons between this and the linear re-
All the tests described provide a formal statistical test gression test described later (Sterne, Gavaghan, and Egger
for asymmetry of a funnel plot, and should be considered 2000), it would appear that the linear regression test is
complementary to it, and are based on the premise of test- more powerful than the rank correlation test, though re-
ing for an association between observed effect size and (a sults of the two tests can sometimes be discrepant.
function of ) its standard error or study sample size. Example. Effects of Teacher Expectancy on Pupil IQ.
23.3.2.1 Nonparametric Correlation Test The rank Applying the rank correlation test to this dataset produces
correlation test examines the association between effect a test statistic of Z  1.75, leading to p  0.08. Although
estimates and their variances as outlined (Begg and Ma- not formally significant at the 5 percent level, such a re-
zumdar 1994). Define the standardized effect sizes of the sult shouldon the basis of the known low power of the
k studies to be combined to be testbe treated with concern. However, because such a
__ __ test does not take into account the modifier (previous
Ti*  (Ti  T)/(~vi*), (23.1) contact with student), such an analysis is of limited use in
this context.
where Ti and vi are the estimated effect size and sampling 23.3.2.2 Linear Regression Tests An alternative para-
variance from the ith study, and the usual fixed-effect es- metric test based on linear regression has been described
timate of the pooled effect is by Matthias Egger and his colleagues (1997). This __ re-
gresses the__standard normal deviate, zi (i.e zi  Ti /vi ),

( )/
k k
T  v 1
j Tj v 1
j , (23.2) against 1/vi :
j1 j1 __
zi  0  1(1/vi )  i , (23.5)
and
1
Where i  N(0, 2). This regression can be thought of as

( )
k
~v*  v  v1 , (23.3) fitting a line to Galbraiths radial plot, in which the fitted
i i j
j1 line is not constrained to go through the origin (1994; see
__ also chapter 22, this volume). The intercept provides a
which is the variance of (Ti  T). measure of asymmetry, and the standard two-sided test of
The test is based on deriving a rank correlation be- whether 0  0 (that is, 0 /se( 0) compared to a t-distri-
tween Ti* and ~vi*, achieved by comparing the ranks of the bution with n  2 degrees of freedom) is Eggers test of
two quantities. Having assigned each Ti* and ~vi* a rank asymmetry. A negative intercept indicates that smaller
(largest value gets rank 1 and so on), it is then necessary studies are associated with bigger effects. This un-weighted
to evaluate all k(k  1) / 2 possible pairs of the k studies. regression is identical to the weighted regression:
Defining the number of all possible pairings in which one __ __
factor is ranked in the same order as the other as P, (that Ti  1  0vi  i.vi , (23.6)
PUBLICATION BIAS 441

with weights 1/vi. Thus, it is possible to plot this regres- 4


sion line on a funnel plot. Note that this model is neither
the standard fixed nor random effect meta-regression
model commonly used (Thompson and Sharp 1999; see

Standardized effect
2
also chapter 16, this volume). Here the error term in-
cludes a multiplicative overdispersion parameter rather
than assuming the usual additive between-study variance
component to account for heterogeneity. There would 0
appear to be no theoretical justification for this model
over the more commonly used meta-regression formula-
tions. 2
Eggers test has highly inflated type 1 errors in some 0 5 10
circumstances when binary outcomes are considered, at Precision
least in part because of the correlation between estimated
(log) odds ratios and their standard errors discussed ear- Figure 23.6 Eggers Regression Line on Galbraith Plot,
lier (Macaskill, Walter, and Irwig 2001; Schwarzer, Antes, Teacher Expectancy on Pupil IQ
and Schumacher 2002; Peters, Sutton 2006). This has led source: Authors compilation.
to a number of modified tests being developed for binary
outcomes (Macaskill, Walter, and Erwig 2001; Harbord,
Egger, and Sterne 2006; Peters et al. 2006; Schwarzer, 0
Antes, and Schumacher 2007) and more generally (Stan-
ley 2005). A comparative evaluation of these modified
.1
tests is required before guidance can be given on which
Standard error

test is optimal in a given situation.


If a meta-analysis dataset displays heterogeneity that .2
can be explained by including study-level covariates, these
should be accounted for before an assessment of publica-
tion bias is made because their influence can distort a fun- .3
nel plot and thus any statistical methods based on it, as
discussed earlier. One way of doing this is to extend the .4
regression models to include study-level covariates. Eval-
.5 0 .5 1 1.5
uating the performance of such an approach is ongoing
Standardized effect size
work. Alternatively, if covariates are discrete, separate as-
sessments can be made for each grouping. Splitting the Figure 23.7 Eggers Regression Line on Funnel Plot,
data in this way will reduce the power of the tests consid- Teacher Expectancy on Pupil IQ
erably, however. source: Authors compilation.
Use of these regression tests is becoming common-
place in the medical literature but is much less common
in other areas, including the social sciences. Although as discussed, detection of funnel asymmetry does not
these tests are generally more powerful than the nonpara- necessarily imply publication bias.
metric alternative, they still have low power for meta- Example: Effects of Teacher Expectancy on Pupil IQ.
analyses of small to moderate numbers of studies. Be- The Eggers regression test is applied to the teacher ex-
cause these are typical in the medical literature, caution pectancy example (and because the outcome is measured
should be exerted in interpreting nonsignificant test re- on a continuous scale the modified alternatives are not ap-
sults. The tests will have much improved performance plicable). Figures 23.6 and 23.7 are two graphical repre-
characteristics with larger numbers of studies, as typi- sentations of the fitted regression line on a Galbraith plot
cally found in the social sciences. Regardless of the num- and funnel plot, respectively. On the Galbraith plot, infer-
ber of studies in the meta-analysis, caution should also be ences are made on whether the intercept is different from
exerted when interpreting significant test results because, 0, though it is whether the gradient of the line is different
442 DATA INTERPRETATION

from 0 that is of interest on the funnel plot (though, of The second and third outline nonparametric and para-
course, estimation is equivalent). The p-value for this test metric methods of adjusting the pooled meta-analytic
is 0.057. As for the rank-based test discussed earlier, this estimate.
does not take into account the known modifier, prior con-
tact time, and hence is of limited value. However further 23.4.1 Fail Safe N
covariates can be included in the regression model allow-
ing known confounders such as this to be adjusted for. In This simple method was one of the first described to ad-
this example, it would be possible to include prior contact dress the problem of publication bias. It considers the
time as either a continuous covariate, as it was recorded, question of how many new studies averaging a null result
or, dichotomized, as done to plot the subgrouped funnels are required to bring the overall treatment effect to non-
in figures 23.4 and 23.5. A scatter plot of effect size against significance (Rosenthal 1979).
prior contact time suggests a nonlinear relationship, and The method is based on combining the normal z-scores
hence prior contact time is dichotomized, as before, for (the Zi) corresponding to the p-values observed for each
simplicity, and included in the regression model. When study. The overall z-score can be calculated by:
this is done the p-value for the test statistic increases to
/
k __
0.70, removing a lot of the suspicion of publication bias, Z  Zi k , (23.7)
though caution should be exerted that reduction in sig- i1

nificance is not due to collinearity of covariates (Sterne where k is the number of studies. This sum of z-scores is
and Egger 2005). a z-score itself and the combined z-scores are consid-
It is important to point out that testing for publication ered significant if Z  Z/2. The /2 percentage point of a
bias in no way deals with the problem if bias exists. In- standardized normal distribution, which is 1.96 for  5
deed, what should be concluded if a significant asymme- percent.
try test result is obtained for a meta-analysis? Clearly The approach determines the number of unpublished
caution in the conclusions of the meta-analysis can be studies, with an average observed effect of zero, there
expressed because the pooled estimate may be biased. would need to be to reduce the overall z-score to nonsig-
Further, whether or not the asymmetry is caused by pub- nificance. Define k0 to be the additional number of studies
lication bias, the usual meta-analysis fixed and random required such that
effect modeling assumptions are probably violated (see
/
k _____
chapter 14, this volume). If funnel symmetry exists for Z
i1
i k  k0  Z/2 . (23.8)
studies on a certain topic today, chances are that asym-
metry will not go away in the future. If it does, it is likely Rearranging this gives us
that it will be completely due to beneficial or statistically
( )/
k 2

significant studies being published more quickly than ko   k  Z


i1
i (Z/2)2 . (23.9)
those with unfavorable or nonsignificant resultsa phe-
nomena sometimes called pipeline bias effects. If the aim After k0 is calculated, one can judge whether it is real-
of a meta-analysis is ultimately to inform policy related istic to assume that this many studies exist unpublished in
decisions, then simply flagging a meta-analysis as being the research domain under investigation. The estimated
potentially biased is not ultimately helpful. The next sec- fail-safe N should be considered in proportion to the num-
tion outlines methods for assessing the impact of possible ber of published studies (k). Rosenthal suggested that the
publication bias on a meta-analysis and examines meth- fail-safe N may be considered as being unlikely, in real-
ods for adjusting for such bias. ity, to exist if it is greater than a tolerance level of 5k  10
(1979).
23.4 METHODS FOR ASSESSING THE IMPACT This method should be considered nothing more than a
OF PUBLICATION BIAS crude guide due to the following shortcomings (Begg and
Berlin 1988). One of these is that combining Z-scores does
Here we examine three approaches to assessing the im- not directly account for the sample sizes of the studies.
pact of publication bias on a meta-analysis. The first pro- Another is that the choice of zero for the average effect of
vides an estimate of the number of missing studies that the unpublished studies is arbitrary. It could easily be less
would need to exist to overturn the current conclusions. than zero. Hence, in practice, many fewer studies might
PUBLICATION BIAS 443

be required to overturn the meta-analysis result. As a re- effect size scale. It also assumes that the studies to the far
sult, it has lead to unjustified complacency about publica- left-hand-side of the funnel are missing (the number is
tion bias. Still another shortcoming is that the method determined by the method) and hence is a one-sided pro-
does not adjust or deal with treatment effectsjust p-val- cedure. If missing studies are expected on the other side
ues (and thus no indication of the true effect size is ob- of the funnel, the data can be flippedby multiplying the
tained). Yet another is that heterogeneity among the stud- effect sizes by 1before applying the method. In short,
ies is ignored (Iyengar and Greenhouse 1988). Last is that the method calculates how many studies would need to
the method is not influenced by the shape of the funnel be trimmed off the right side of the funnel, to leave a
plot (Orwin 1983). symmetric center. The adjusted pooled effect size is then
Example: Effects of Teacher Expectancy on Pupil IQ. calculated. The trimmed studies are reinstated together
Individual z scores are computed by dividing the effect with the same number of filled studies that are imputed as
sizes by the corresponding standard errors. This leads to mirror images. This augmented symmetrical dataset is
Zi 11.03. Using equation 23.9 gives k0  19  (11.03/ used to calculate the variance of the pooled estimate.
1.96)2, leading to k0  13. This implies that if thirteen or An iterative procedure is used to obtain an estimate of
more studies are unretrieved with zero effect (compared the number of missing studies. Three estimators were
to the nineteen published ones in the dataset), on aver- originally investigated, two shown through simulation to
age, then the meta-analysis would be nonsignificant at the be superior (Duval and Tweedie 2000a, 2000b). Initially,
5 percent level. However, as before, the effect of previous the pooled effect size is subtracted from each studys ef-
contact with student has not been adjusted for in the anal- fect size estimate. Ranks are then assigned according to
ysis and hence these results should be treated with cau- their absolute values. It is then helpful to construct the
tion. Furthermore, extensions to allow for including co- vector of rankings, ordered by effect size, indicating those
variates have not been developed. Failsafe N could be estimates that are less than the pooled effect size by add-
calculated for separate subgroups of data (defined by co- ing a negative indicator. For example, if a vector (6,
variates) but this is not pursued here. 5, 2, 1, 3, 4, 7, 8, 9, 10, 11) were obtained, it would
Several extensions and variations of Rosenthals file- indicate that three of the eleven estimates are less than the
drawer method have been developed that include alterna- pooled effect size, and have absolute effect magnitudes
tive file-drawer calculations (Orwin 1983; Iyengar and ranked two, five, and six in the dataset (higher ranks indi-
Greenhouse 1988; Becker 2005; Gleser and Olkin 1996). cating studies with estimates that deviate the most from
One is based on effect-size rather than p-value reduction the pooled effect estimate). The first estimator of the num-
(Orwin 1983) and methods for estimating the number of ber of missing studies is defined as
unpublished studies (Gleser and Olkin 1996). This latter
method can be seen as a halfway house between simple R0  * 1, (23.10)
file-drawer approaches and the selection modeling ap-
where * is the rightmost run of consecutive ranks of the
proaches outlined below.
vector definedthat is, for the example given, *  5 be-
cause the rightmost run of ranks is of length five, deter-
23.4.2 Trim and Fill mined by the run seven, eight, nine, ten, and eleven, hence
R0  4.T. The estimated R0 studies with the largest effect
The nonparametric trim and fill method for adjusting for
sizes are removed (trimmed) and the process continues
publication bias was developed as a simpler alternative to
iteratively until the estimate of the number of missing
parametric selection models (section 23.4.3) for adjust-
studies converges. The second estimate is
ing the problem of publication bias (Duval and Tweedie
2000a, 2000b). It is indeed now the most popular method L0  [4Tn  n(n  1)]/[2n 1], (23.11)
for adjusting for publication bias (Borenstein 2005). Also,
because the method is a way of formalizing the use of a where Tn is the sum of the ranks of the studies whose ef-
funnel plot, the ideas underlying the statistical calcula- fect sizes are greater than the pooled estimate (the Wil-
tions can be communicated and understood visually, in- coxon statistic for the dataset) and n indicates the number
creasing its accessibility. of studies. In the example vector given, Tn  1  3  4 
The method is based on rectifying funnel plot asym- 7  8  9  10  11  53. It follows that L0  [4(53) 11
metry, and assumes that suppression is working on the (12)]/[2(11)1]  3.81, which is rounded to four missing
444 DATA INTERPRETATION

studies estimated. Like, R0, L0 is calculated iteratively .1


until convergence is achieved. Full worked examples of
these iterative calculations are available elsewhere (Duval
and Tweedie 1998; Duval 2005). Since nether estimate is

Standard error
superior in all situations, use of both as a sensitivity anal- .2
ysis has been recommended (Duval 2005).
In practice, either a fixed or random effect meta-analy-
sis estimate can be used for the iterative trimming and
.3
filling parts of this method. In the initial exposition, ran-
dom effect estimates were used (Duval and Tweedie
2000a, 2000b). Random effect estimators can be more in-
fluenced by publication bias, however, given that the less .4
precise studiesthose more prone to publication bias 1 .5 0 .5 1
are given more weight in relative terms (Poole and Green- Standardized effect size
land 1999). Using fixed-effect methods to trim studies
should therefore be considered and has recently been Figure 23.8 Trim and Fill Analysis, Teacher Expectancy on
evaluated, though it does not always outperform the use Pupil IQ
of a random-effect model (Peters et al. 2007). source: Authors compilation.
The performance of this method was initially evalu- key: Circles are original studies, triangles are filled imputed
ated using simulation studies under conditions of study studies
homogeneity, and under the suppression mechanism the
method assumes, that is, the n-most extreme effect sizes
tor variables, separate analyses of subgroups of the data
are suppressed (Duval and Tweedie 2000a, 2000b). Under
are possible. If this is done here, no studies are filled for
these conditions, it performed well. However, further
prior contact longer than a week subgroup and one study
simulations suggest that it performs poorly in the pres-
is imputed for contact of a week or less subgroup, which
ence of among-study heterogeneity when no publication
changes the pooled estimate by only a small amount (0.357
bias is present (Terrin et al. 2003). Further evaluation of
(0.203 to 0.511) to 0.319 (0.168 to 0.469).
the method into scenarios where suppression is assumed
to be based on statistical significance, with and without
heterogeneity present, suggests that it outperforms non- 23.4.3 Selection Modeling
adjusted meta-analytic estimates but can provide correc-
tions that are far from perfect (Peters et al. 2007). Selection models can be used to adjust a meta-analytic da-
As for interpretation of the funnel plot, as discussed, taset by specifying a model that describes the mechanism
the asymmetry this model rectifies may be caused by fac- by which effect estimates are assumed to be either sup-
tors other than publication bias, which should be borne in pressed or observed. This is used with an effect size model
mind when interpreting the results. A limitation of the that describes what the distribution of effect sizes would
method is that it does not allow for the inclusion of study be if there were no publication bias to correct for it.
level covariates as moderator variables. If the selection model is known, use of the method is
Example: Effects of Teacher Expectancy on Pupil IQ. relatively straightforward but the precise nature of the
For illustration, Trim and fill, using a fixed effect model suppression will usually be unknown. If this is the case,
and the L0 estimator is applied to the teacher expectancy alternative candidate selection models can be applied as a
dataset. Three studies are estimated as missing and im- form of sensitivity analysis (Vevea and Woods 2005), or
puted as shown in figure 23.8 (the triangles). The fixed the selection model can actually be estimated from the
effect pooled estimate for the original dataset was 0.060 observed meta-analytic dataset. Because the methods are
(95 percent CI 0.011 to 0.132) and this reduced to 0.025 fully parametric, they can include study-level moderator
(95 percent CI 0.045 to 0.095) when the three studies variables. Although complex to implement, they have
were filled. Because this analysis ignores the covariate been recommended over other methods, which potentially
previous contact with studentit should be interpreted produce misleading results, in scenarios where effect
cautiously. Although trim and fill does not allow modera- sizes are heterogeneous (Terrin et al. 2003). A potential
PUBLICATION BIAS 445

1.0 1.0

0.8 0.8

Probability of selection
Probability of selection

0.6
0.6

0.4
0.4

p 0.05
0.2 p 0.05
0.2

0.0
0.00 0.20 0.40 0.60 0.80 1.00 0.0
p-value
0.00 0.20 0.40 0.60 0.80 1.00
Figure 23.9 Simple Weight Function, Published if Signifi- p-value
cant
Figure 23.10 Example Weight Function, Probability of Pub-
source: Authors compilation. lication Diminishes
source: Authors compilation.
drawback is that they perform poorly if the number of
observed studies is small.
A comprehensive review of selection models has re- p-value  0.05 has a probability of one (certainty) of being
cently been published elsewhere and is recommended observed and zero otherwise. Perhaps it is more realistic
reading for an expanded and more statistically rigorous to assume that studies that are not statistically significant
account particularly regarding the implementation of have a fixed probability of less than one that they are
these methods that require complex numerical methods publishedthat is, not all nonsignificant studies are sup-
(Hedges and Vevea 2005). pressed. Alternatively, it could be assumed that the prob-
Two classes of explicit selection models have been de- ability a nonsignificant study is published is some con-
veloped: those that model suppression as a function of an tinuous decreasing function of the p-valuethat is, the
effect sizes p-value and and those that model suppres- larger the p-value, the less chance a study has of being
sion as a function of a studys effect size and standard published (Iyengar and Greenhouse 1988). Figure 23.10
error simultaneously. Both are implemented through illustrates a possible weight function in this class with the
weighted distributions that give the likelihood a given ef- probability of observing an effect size diminishing expo-
fect estimate will be observed if it occurs. Each are dis- nentially as the p-value increases. More realistic still may
cussed in turn. be to assume a step function with steps specified as occur-
23.4.3.1 Suppression as a Function of p-Value Only ring at perceived milestones in statistical significance
The simplest selection model suggested is one that as- based on the psychological perception that a p-value of
sumes all studies with statistically significant effect sizes 0.049 is considerably different from one of 0.051 (Hedges
are observed (for example p  0.05 two-tailed, or p  0.975 1992; Vevea, Clements, and Hedges 1993; Hedges and
or p  0.025 one-tailed) and all others are suppressed Vevea 1996). Figure 23.11 provides a graphical represen-
(Lane and Dunlap 1978; Hedges 1984). The associated tation of a weight function of this type. Finally, it is pos-
weight function is represented graphically in figure 23.9, sible that the location and the height of the steps in such
where the likelihood a study is observed if it occurs is a weight function can be determined from the data (Dear
given on the y-axis (w ( p)) and the p-value scale is along and Begg 1992). It should be noted, however, that when
the x-axis. Using this model, any effect size with a the selection model is left to be determined by the data, it
446 DATA INTERPRETATION

1.0' parameters that determine the suppression process (Copas


and Shi 2000, 2001). To this end, a bound for the bias as
a function of the fraction of missing studies has been pro-
0.8' posed (Copas and Jackson 2004).
23.4.3.3 Simulation Approaches One recent devel-
Probability of selection

opment is a simulated pseudo-data approach to correct for


0.6' publication bias (Bowden, Thompson, and Burton 2006).
This is not really a different type of selection model, but
rather an alternative way of evaluating them. It requires
0.4' the selection model to be fully specified, but does allow
arbitrarily complex selection processes to be modeled.
23.4.3.4 Bayesian Approaches Several authors have
0.2' considered implementing selection models from a Bayes-
ian perspective (Givens, Smith, and Tweedie 1997; Silli-
man 1997). One feature of this approach is that it can
0.0' provide an estimate of the number of suppressed studies
0.0 0.2 0.4 0.6 0.8 1.0 and their effect size distribution.
p0.01
p0.05
p0.10

p0.50

Example: Effects of Teacher Expectancy on Pupil IQ.


A detailed analysis of applying different selection models
to this datasetincluding those that depend on the p-value
p-value with and without the weight function estimated from the
data, and those depending on effect size and standard
Figure 23.11 Example Psychological steps Weight Func- error simultaneouslyis available elsewhere (Hedges and
tion Vevea 2005). Here, I summarize one of these analyses as
source: Authors compilation. an example of how such models can be applied.
This analysis uses a fixed effect model including prior
contact time as a dichotomous moderator variable, as de-
is often estimated poorly because the observed effects in- fined earlier. The weight function has two cut points at
clude little information on the selection process. It is pos- p  0.05 and 0.30 with weights estimated from the data.
sible to include study level covariates in such models Although using more cut points may be desirable, the rela-
(Vevea and Hedges 1995). This can be done to remove tively small number of studies mean parameters would
confounding effects in the distribution of effect sizes as not be estimatable. The estimated weight function is pre-
discussed when considering alternative reasons for fun- sented in figure 23.12. Here the probability of selection is
nel plot asymmetry. set to one for p-values between 0 and  0.05 and it is
23.4.3.2 Suppression as a Function of Effect Size estimated as 0.89 (SE 0.80) and 0.56 (SE 0.64) for the
and Its Standard Error These models assume the prob- intervals 0.05 to  0.30 and 0.30 to 1.00 respectively.
ability that a result is suppressed is a function of both ef- Note the large standard errors on these weights, which is
fect size and standard error, with the probability of sup- indicative of models in which the weights are defined by
pression increasing as effect size decreases and standard the data and particularly when the number of studies is
error increases (Copas and Li 1997; Copas and Shi 2000). not large. The estimated pooled effect for less than one
Although such a model is perhaps more realistic than one week of prior contact changed from 0.349 (SE 0.079) to
modeling suppression as related to p-value, it does mean 0.321 (SE 0.111) when the selection model was applied
there are identifiability problems, implying parameters and the covariate effect similarly changed from 0.371
cannot be estimated from the effect size data alone. This (SE 0.089) to 0.363. 0.109), indicating that the conclu-
in turn means that if you believe this model to be plausi- sions appear to be relatively robust to the effects of pub-
ble, correcting for publication bias is not possible without lication bias.
making assumptions that cannot be tested (Copas and Shi A summary of the results from applying the different
2001). It has thus been recommended to evaluate such methods to this dataset is presented in table 23.1 and
models over the plausible regions of values for the model shows broad consistency. When prior contact is not ac-
PUBLICATION BIAS 447

1.00 the meta-analysis. This effectively gives smaller studies


zero weight in the meta-analysis. If publication bias only
Probability of selection

0.75 exists, or is most severe in the small studies (an assump-


tion that underlies several of the statistical models pre-
sented) this will generally result in less biased pooled
0.50 estimates, however the choice of cut-off for study inclu-
sion/exclusion is arbitrary.
0.25
23.5 METHODS TO ADDRESS SPECIFIC
0.0 DISSEMINATION BIASES
0.0 0.30 1.00
As mentioned earlier, incomplete data reporting may
p-Value
occur at various levels below the suppression of whole
Figure 23.12 Estimate Weight Function, Teacher Expec- studies. Analysis of specific outcomes or subgroups, for
tancy Example example, may have been conducted but not been written
up and hence available for meta-analysis. Similarly, all
source: Hedges and Veva selection method approaches.
the details of an analysis required for meta-analysis may
not be reported. For example, the standard error of an ef-
fect size or a study level covariate required for meta-re-
counted for in the model, a suggestion of asymmetry or gression may not have been published.
bias is evident, but when this possible confounder is in- This latter category of missing data may be entirely in-
cluded, the threat of bias appears much lessened. nocent and be attributable simply to lack of journal space
or a lack of awareness about the importance of reporting
23.4.4 Excluding Studies Based on Their Size such information. Such missingness may thus not be re-
lated to outcome and data can be assumed missing (com-
A simple alternative to adjusting for publication bias is to pletely) at random. If this is the case, standard methods
assume studies of a certain size, regardless of their re- for dealing with missing data can be applied to the meta-
sults, will be published and hence only include these in analytic dataset, though this is rarely done in practice

Table 23.1 Summary, Methods to Address Publication Bias

Method Findings

Funnel plot of all the data Funnel visibly asymmetric


Funnel plot subgrouped by prior contact Little evidence of asymmetry for either funnel
Eggers regression test ignoring prior contact data p  0.057 (i.e. some evidence of bias)
Eggers regression test including prior contact covariate p  0.70 (i.e. little evidence of bias)
File drawer method ignoring prior contact data 13 studies would have to be missing (compared to 19) to
overturn conclusions
Trim and Fill ignoring prior contact data 3 studies estimated missing. Effect size changes from 0.060
(s.e. 0.036) to 0.025 (s.e. 0.036)
Trim and Fill subgrouped by prior contact For  1 week prior contact, 0 studies estimated missing
For  1 week prior contact 1, study estimated missing. Effect
size changes from 0.357 (s.e. 0.079) to 0.319 (s.e. 0.077)
Selection model including prior contact covariate For  1 week prior contact, effect size changes from 0.371
(fixed effect, 2 cut points (p  0.05 and 0.30) (s.e. 0.089) to 0.363 (0.109)
estimated from data) For  1 week prior contact, effect size changes from 0.349
(s.e. 0.079) to 0.321 (s.e. 0.111)

source: Authors compilation.


448 DATA INTERPRETATION

(Little and Rubin 1987; Pigott 2001; Sutton and Pigott assumed reported with no bias, provided information on a
2004). More ad hoc methods can also be applied as nec- further outcome where selectivity in reporting was evi-
essary (Song et al. 1993; Abrams, Gillies, and Lambert dent. Assuming all bias was due to outcome reporting,
2005). that is, there were no completely missing studies, the extra
If the data missing is due to less innocent reasons then information gained from the outcome for which publica-
it is probably safest to assume data are not missing at tion bias was assumed not present meant that the selection
random. This is the typical assumption made when ad- model was identifiable and a unique solution could be
dressing publication bias, though it is often not framed gained without further assumptions.
this way. Because outcomes and subgroups and the like
may be suppressed under similar mechanisms to those 23.5.2 Subgroup Reporting Biases
acting on whole studies, such missingness may manifest
itself in a similar way and hence the methods covered Subgroup analysis suppression could exist under a simi-
thus far will be appropriate to address it. There may be lar mechanism to that for outcomes; potentially many
advantages, however, of developing and applying meth- subgroups could be compared, and only those interesting
ods that address specific forms of missing information. or statistically significant reported. A sensitivity analysis
This area of research is in its infancy. approach similar to that for outcome reporting bias has
been suggested that involves data imputation for missing
23.5.1 Outcome Reporting Biases subgroup data under the assumption that the results that
are not published tend to be nonsignificant (Hahn et al.
The issue of outcome reporting bias has received consid- 2000).
erable attention in recent years, and empirical research
indicates that it is a serious problem, at least for random- 23.5.3 Time-Lag Bias
ized controlled trials in medicine (Hahn, Williamson,
and Hutton 2002; Chan, Hrbjartsson et al. 2004; Chan, Time-lag bias may exist if research with large effect sizes
Krleza-Jeric et al. 2004; Chan and Altman 2005). This or significant results is published more quickly than that
evidence lends weight to the premise that if a study mea- with small effect sizes or insignificant results. When such
sures multiple outcomes, those that are statistically sig- a bias operates, the first published studies show that sys-
nificant are more likely to be published than those that are tematically greater effect sizes compare to subsequently
not. Such bias could be expected to be detected as general published investigations (Trikalinos and Ioannidis 2005).
publication bias and dealt with in that way. Such an anal- The summary effect would thus diminish over time as more
ysis, however, would ignore the patterns of missing data data accumulates. Evidence such a bias exists in the field
across the multiple outcomes measured across all studies of genetic epidemiology is strong (Ioannidis et al. 2001).
and information, such as levels of statistical significance, Methods to detect such biases have been described else-
across all reported outcomes. A method has been devel- where (Trikalinos and Ioannidis 2005).
oped that does consider such patterns and adjusts esti-
mates under the assumption that the most statistically 23.6 DISCUSSION
significant of a fixed number of identically distributed in-
dependent outcomes are reported (Hutton and William- The aim of this chapter is to provide an overview of the
son 2000; Williamson and Gamble 2005). Although its problem of publication bias with a focus on the methods
assumptions are unrealistically strict, the model does pro- developed to address it in an research synthesis context.
vide an upper bound on the likely impact of outcome re- Whatever approach is taken, the fact remains that publi-
porting bias and could help determine whether contacting cation bias is a difficult problem to deal with because the
study investigators for the potentially missing data would mechanisms causing the bias are usually unknown, and
be cost effective. the merit of any method to address it depends on how
A selection model has also been developed for a con- close the assumptions the method makes are to the truth.
text-specific applicationsuccess rates of surgery for Methods developed for adjusting pooled results have there-
emergency aneurysm repairto address outcome report- fore been recommended only as sensitivity analyses. Al-
ing bias based on the dataset plotted in figures 23.1 and though this would seem sensible advice, the context of the
23.2 (Jackson, Copas, and Sutton 2005). Here an outcome, meta-analysis would also seem an important issue given
PUBLICATION BIAS 449

that sensitivity analysis is of limited use in a decision Alexander J. Sutton, and Michael Borenstein. Chichester:
making context. That is, you have to make one decision John Wiley & Sons.
based on the synthesis regardless of how many sensitivity Begg, Colin B., and Jesse A. Berlin. 1988. Publication Bias:
analyses are performed. A Problem in Interpreting Medical Data (with discussion).
Software has been developed that implements many of Journal of the Royal Statistics Society Series A. 151(3):
the methods described here. An overview and compari- 41963.
son of software functionality in addressing publication Begg, Colin B., and Madhuchhanda Mazumdar. 1994. Ope-
bias is available elsewhere (Borenstein 2005). Exceptions rating Characteristics of a Rank Correlation Test for Publi-
are the selection models and methods to deal specifically cation Bias. Biometrics 50(4): 1088101.
with outcome, subgroup, and time-lag biases. If these Bennett, Derrick A., Nancy K. Latham, Caroline Stretton,
methods are to be used more routinely, more user-friendly and Craig S. Anderson. 2004. Capture-Recapture Is a Po-
software clearly needs to be developed. tentially Useful Method for Assessing Publication Bias.
Novel and emerging areas of research in publication Journal of Clinical Epidemiology 57(4): 34957.
bias not covered here include an investigation of how pub- Berlin, Jesse A., and Davina Ghersi. 2005. Preventing Pu-
lication bias affects the estimation of between-study het- blication Bias: Registries and Prospective Meta-Analysis.
erogeneity parameter in a random effects meta-analysis In Publication Bias in Meta-Analysis: Prevention, Assess-
model (Jackson 2006), allowing for differential levels of ment and Adjustments, edited by Hannah R. Rothstein,
publication bias across studies of different study designs Alexander J. Sutton, and Michael Borenstein. Chichester:
(Sutton, Abrams, and Jones 2002), and capture-recapture John Wiley & Sons.
methods across electronic databases to estimate the num- Borenstein, Michael. 2005. Software for Publication Bias.
ber of missing studies (Bennett et al. 2004). In Publication Bias in Meta-Analysis: Prevention, Assess-
Future investigation may prove fruitful in at least three ment and Adjustments, edited by Hannah R. Rothstein,
areas. One involves developing a unified approach for Alexander J. Sutton and Michael Borenstein. Chichester:
dealing with data missing due to different selection mech- John Wiley & Sons.
anisms, such as simultaneously adjust for missing stud- Bowden, Jack, John R. Thompson, and Paul Burton. 2006.
ies and missing outcomes for known studies separately. Using Pseudo-Data to Correct for Publication Bias in
Another entails using meta-epidemiological research to Meta-Analysis. Statistics in Medicine 25(22): 37983813.
develop global estimates of bias as a function of study Bown, Matthew. J., Alexander J. Sutton, Peter R. Bell, and
precision that might in turn be used to adjust future meta- Robert D. Sayers. 2002. A Meta-Analysis of 50 Years of
analyses by incorporating them into distributions for a Ruptured Abdominal Aortic Aneurysm Repair. British
Bayesian analysis (Sterne et al. 2002). A third is develop- Journal of Surgery 89(6): 118.
ing methods to address publication bias for statistical Chan, An-Wen, and Douglas G. Altman. 2005. Identifying
models that are more complicated than a standard meta- Outcome Reporting Bias in Randomised Trials on Pub-
analysis, where construction of a funnel plot is often not Med: Review of Publications and Survey of Authors. Bri-
possible (Ades and Sutton 2006). tish Medical Journal 330(7494): 753.
Chan, An-Wen, Asbjorn Hrobjartsson, Mette T. Haahr, Peter
23.7. REFERENCES C. Gotzsche, and Douglas G. Altman. 2004. Empirical
Evidence for Selective Reporting of Outcomes in Randomi-
Abrams, Keith R., Clare L. Gillies, and Paul C. Lambert. zed Trials. Comparison of Protocols to Published Articles.
2005. Meta-Analysis of Heterogeneously Reported Trials Journal of the American Medical Association 291(20):
Assessing Change from Baseline. Statistics in Medicine 245765.
24(24): 382344. Chan, An-Wen, Karmela Krleza-Jeric, Isabelle Schmid, and
Ades, Anthony, and Alexander J. Sutton. 2006. Multiple Douglas G. Altman. 2004. Outcome Reporting Bias in
Parameter Evidence Synthesis in Epidemiology and Medi- Randomized Trials Funded by the Canadian Institutes of
cal Decision-Making: Current Approaches. Journal of the Health Research. Canadian Medical Association Journal
Royal Statistical Society Series A 169(1): 535. 171(7): 73540.
Becker, Betsy J. 2005. Failsafe N or File-Drawer Number. Copas, John B., and Daniel Jackson. 2004. A Bound for
In Publication Bias in Meta-Analysis: Prevention, Assess- Publication Bias Based on the fraction of Unpublished Stu-
ment and Adjustments, edited by Hannah R. Rothstein, dies. Biometrics 60(1): 14653.
450 DATA INTERPRETATION

Copas, John B., and Hu G. Li. 1997. Inference for Non- in Clinical Research: Follow-Up of Applications Submitted
Random Samples. Journal of the Royal Statistical Society, to a Local Research Ethics Committee. Journal of Evalua-
Series B 59(1): 5595. tion in Clinical Practice 8(3): 35359.
Copas, John B., and Jian Q. Shi. 2000. Reanalysis of Epide- Hahn, S., Paula R. Williamson, Jane L. Hutton, Paul Garner,
miological Evidence on Lung Cancer and Passive Smo- and E. Victor Flynn. 2000. Assessing the Potential for Bias
king. British Medical Journal 320(7232): 41718. in Meta-Analysis Due to Selective Reporting of Subgroup
. 2001. A Sensitivity Analysis for Publication Bias Analyses Within Studies. Statistics in Medicine 19(24):
in Systematic Reviews. Statistical Methods in Medical Re- 332536.
search 10(4): 25165. Halpern, Scott D., and Jesse A. Berlin. 2005. Beyond
Dear, Keith B.G., and Colin B. Begg. 1992. An Approach Conventional Publication Bias: Other Determinants of Data
for Assessing Publication Bias Prior to Performing a Meta- Suppression. In Publication Bias in Meta-Analysis: Pre-
Analysis. Statistical Science 7(2): 23745. vention, Assessments and Adjustments, edited by Hannah R.
Dickersin, Kay. 2005. Publication Bias: Recognizing the Rothstein, Alexander J. Sutton, and Michael Borenstein.
Problem, Understanding its Origins and Scope, and Pre- Chichester: John Wiley & Sons.
venting Harm. In Publication Bias in Meta-Analysis: Pre- Harbord, Roger M., Matthias Egger, and Jonathan A.C.
vention, Assessment and Adjustments, edited by Hannah R. Sterne. 2006. A Modified Test for Small-Study Effects in
Rothstein, Alexander J. Sutton, and Michael Borenstein. Meta-Anlayses of Controlled Trials with Binary End-
Chichester: John Wiley & Sons. points. Statistics in Medicine 25(20): 344357.
Duval, Sue. 2005. The Trim and Fill Method. In Publica- Hedges, Larry V. 1984. Estimation of Effect Size under
tion Bias in Meta-Analysis: Prevention, Assessment and Nonrandom Sampling: The Effects of Censoring Studies
Adjustments, edited by Hannah R. Rothstein, Alexander J. Yielding Statistically Insignificant Mean Differences.
Sutton, and Michael Borenstein. Chichester: John Wiley & Journal of Educational Statistics 9(1): 6185.
Sons. . 1992. Modeling Publication Selection Effects in
Duval, Sue, and Richard Tweedie. 1998. Practical Estima- Meta-Analysis. Statistical Science 7(2): 24655.
tes of the Effect of Publication Bias in Meta-Analysis. Hedges, Larry V., and Jack L. Vevea. 1996. Estimating Ef-
Australasian Epidemiologist 5(449): 1417. fects Size Under Publication Bias: Small Sample Proper-
. 2000. A NonParametric Trim and Fill Method of ties and Robustness of a Random Effects Selection Model.
Assessing Publication Bias in Meta-Analysis. Journal of Journal of Educational and Behavioral Statistics 21(4):
the American Statistical Association 95(449): 8998. 299333.
. 2000. Trim and Fill: A Simple Funnel Plot Based . 2005. Selection Method Approaches. In Publica-
Method of Testing and Adjusting for Publication Bias in tion Bias in Meta-Analysis: Prevention, Assessments and
Meta-Analysis. Biometrics 56(2): 45563. Adjustments, edited by Hannah R. Rothstein, Alexander J.
Egger, Matthias, George Davey Smith, Martin Schneider, Sutton, and Michael Borenstein. Chichester: John Wiley &
and Christoph Minder. 1997. Bias in Meta-Analysis De- Sons.
tected by a Simple, Graphical Test. British Medical Jour- Hopewell, Sally, Michael Clarke, and Sue Mallett. 2005.
nal 315(7109): 62934. Grey Literature and Systematic Reviews. In Publication
Galbraith, Rex. F. 1994. Some Applications of Radial Plots. Bias in Meta-Analysis: Prevention, Assessment and Adjust-
Journal of the American Statistical Association 89(428): ments, edited by Hannah R. Rothstein, Alexander J. Sutton,
123242. and Michael Borenstein. Chichester: John Wiley & Sons.
Givens, Geof H., David D. Smith, and Richard L Tweedie. Hutton, Jane L., and Paula R. Williamson. 2000. Bias in
1997. Publication Bias in Meta-Analysis: A Bayesian Meta-Analysis due to Outcome Variable Selection Within
Data-Augmentation Approach to Account for Issues Exem- Studies. Applied Statistics 49(3): 35970.
plified in the Passive Smoking Debate. Statistical Science Ioannidis, John P., Evangelina E. Ntzani, Thomas A. Trikali-
12(4): 22150. nos, and Despina G. Contopoulos-Ioannidis. 2001. Repli-
Gleser, Leon J, and Ingrim Olkin. 1996. Models for Estima- cation Validity of Genetic Association Studies. Nature
ting the Number of Unpublished Studies. Statistics in Me- Genetics 29(3): 3069.
dicine 15(23): 24932507. Iyengar, Satish, and Joel B. Greenhouse. 1988. Selection
Hahn, Seokyung, Paula R. Williamson, and Jane L. Hutton. Models and the File Drawer Problem. Statistical Science
2002. Investigation of Within-Study Selective Reporting 3(1): 10935.
PUBLICATION BIAS 451

Jackson, Daniel. 2006. The Implication of Publication Bias Silliman, Nancy P. 1997. Hierarchical Selection Models
for Meta-Analysis Other Parameter. Statistics in Medicine with Applications in Meta-Analysis. Journal of the Ameri-
25(17): 291121. can Statistical Association 92(429): 92636.
Jackson, Daniel, John Copas, and Alexander Sutton. 2005. Smith, Richard. 1999. What Is Publication? A Continuum.
Modelling Reporting Bias: The Operative Reporting Rate British Medical Journal 318(7177): 142.
for Ruptured Abdominal Aortic Aneurysm Repair. Journal Song, Fujan, Alison Easterwood, Simon Guilbody, Lelia
of the Royal Statistical Society Series A 168(4): 73752. Duley, and Alexander J. Sutton. 2000. Publication and
Lane, David M., and William P. Dunlap. 1978. Estimating Other Selection Biases in Systematic Reviews. Health
Effect-Size Bias Resulting from Significance Criterion in Technology Assessment 4(10): 1115.
Editorial Decisions. British Journal of Mathematical and Song, Fujan, Nick Freemantle, Trevor A. Sheldon, Allan
Statistical Psychology 31: 10712. House, Paul Watson, and Andrew Long. 1993. Selective
Lau, Joseph, John P. A. Ioannidis, Norma Terrin, Christo- Serotonin Reuptake Inhibitors: Meta-Analysis of Efficacy
pher H Schmid, and Ingram Olkin. 2006. The Case of the and Acceptability. British Medical Journal 306(6879):
Misleading Funnel Plot. British Medical Journal 333(7568): 68387.
597600. Stanley, Tom D. 2005. Beyond Publication Bias. Journal
Little, Roderick J. A., and Don. B. Rubin. 1987. Statistical of Economic Surveys 19(3): 30945.
Analysis with Missing Data. New York: John Wiley & Stern, Jerome M., and R. John Simes. 1997. Publication
Sons. Bias: Evidence of Delayed Publication in a Cohort Study of
Macaskill, Petra, Stephen D. Walter, and Lesley Irwig. 2001. Clinical Research Projects. British Medical Journal
A Comparison of Methods to Detect Publication Bias in 315(7109): 64045.
Meta-Analysis. Statistics in Medicine 20(4): 64154. Sterne, Jonathan A.C., and Matthias Egger. 2001. Funnel
Orwin, Robert G. 1983. A Fail-Safe N for Effect Size in Plots for Detecting Bias in Meta-Analysis: Guidelines on
Meta-Analysis. Journal of Educational Statistics 8(2): Choice of Axis. Journal of Clinical Epidemiology 54(10):
15759. 104655.
Peters, Jaime L., Alexander J. Sutton, David R. Jones, Keith . 2005. Regression Methods to Detect Publication
R. Abrams, and Lesley Rushton. 2006. Comparison of and Other Bias in Meta-Analysis. In Publication Bias in
Two Methods to Detect Publication Bias in Meta-Analy- Meta-Analysis: Prevention, Assessments and Adjustments,
sis. Journal of the American Medical Association 295(6): edited by Hannah R. Rothstein, Alexander J. Sutton, and
67680. Michael Borenstein. Chichester: John Wiley & Sons.
. 2007. Performance of the Trim and Fill Method in Sterne, Jonathan A.C., Betsy J. Becker, and Matthias Egger.
the Presence of Publication Bias and Between-Study Hete- 2005. The Funnel Plot. In Publication Bias in Meta-Ana-
rogeneity. Statistics in Medicine 26(25): 454462. lysis, edited by Hannah R. Rothstein, Alexander J. Sutton
Pigott, Therese D. 2001. Missing Predictors in Models of and Michael Borenstein. Chichester: John Wiley & Sons.
Effect Size. Evaluation and the Health Professions 24(3): Sterne, Jonathan A.C., David Gavaghan, and Matthias Egger.
277307. 2000. Publication and Related Bias in Meta-Analysis:
Poole, Charles, and Sander Greenland. 1999. Random-Ef- Power of Statistical Tests and Prevalence in the Literature.
fects Meta-Analysis Are not Always Conservative. Ameri- Journal of Clinical Epidemiology 53(11): 111929.
can Journal of Epidemiology 150(5): 46975. Sterne, Jonathan A.C., Peter Juni, Kenneth F. Schulz, Dou-
Rosenthal, Robert. 1979. The File Drawer Problem and glas G. Altman, Christopher Bartlett, and Matthias Egger.
Tolerance for Null Results. Psychological Bulletin 86: 2002. Statistical Methods for Assessing the Influence of
63841. Study Characteristics on Treatment Effects in Meta-Epi-
Schwarzer, Guido, Gerd Antes, and Martin Schumacher. demiological Research. Statistics in Medicine 21(11):
2002. Inflation of Type 1 Error Rate in Two Statistical 151324.
Tests for the Detection of Publication Bias in Meta-Analy- Sutton, Alexander J., and Therese D. Pigott. 2004. Bias in
ses with Binary Outcomes. Statistics in Medicine 21(17): Meta-Analysis Induced by Incompletely Reported Stu-
246577. dies. In Publication Bias in Meta-Analysis: Prevention,
. 2007. A Test for Publication Bias in Meta-Analysis Assessments and Adjustments, edited by Hannah Rothstein,
with Sparse Binary Data. Statistics in Medicine 26(4): Alexander J. Sutton, and Michael Borenstein. Chichester:
72133. John Wiley & Sons.
452 DATA INTERPRETATION

Sutton, Alexander J., Keith R. Abrams, and David R. Jones. cation Bias in Meta-Analysis: Prevention, Assessment and
2002. Generalized Synthesis of Evidence and the Threat Adjustments, edited by Hannah R. Rothstein, Alexander J.
of Dissemination Bias: The Example of Electronic Fetal Sutton, and Michael Borenstein. Chichester: John Wiley &
Heart Rate Monitoring (EFM). Journal of Clinical Epide- Sons.
miology 55(10): 101324. Vevea, Jack L., and Larry V. Hedges. 1995. A General
Terrin, Norma, Christopher H. Schmid, and Joseph Lau. Linear Model for Estimating Effect size in the Presence of
2005. In an Empirical Evaluation of the Funnel Plot, Re- Publication Bias. Psychometrika 60(3): 41935.
searchers Could Not Visually Identify Publication Bias. Vevea, Jack L. and Carol M. Woods. 2005. Publication Bias
Journal of Clinical Epidemiology 58(9): 894901. in Research Synthesis: Sensitivity Analysis Using a Priori
Terrin, Norma, Christopher H. Schmid, Joseph Lau, and In- Weight Functions. Psychological Methods 10(4): 42843.
gram Olkin. 2003. Adjusting for Publication Bias in the Vevea, Jack L., Nancy C. Clements, and Larry V Hedges.
Presence of Heterogeneity. Statistics in Medicine 22(13): 1993. Assessing the Effects of Selection Bias on Validity
211326. Data for the General Aptitude Test Battery. Journal of Ap-
Thompson, Simon G., and Stephen J. Sharp. 1999. Explai- plied Psychology 78(6): 98187.
ning heterogeneity in meta-analysis: a comparison of Williamson, Paula R., and Carol Gamble. 2005. Identifica-
methods. Statistics in Medicine 18(20): 26932708. tion and Impact of Outcome Selection Bias in Meta-Analy-
Trikalinos, Thomas A., and John P.A. Ioannidis. 2005. As- sis. Statistics in Medicine 24(10): 154761.
sessing the Evolution of Effect Sizes Over Time. In Publi-
24
ADVANTAGES OF CERTAINTY AND UNCERTAINTY

WENDY WOOD ALICE H. EAGLY


Duke University Northwestern University

CONTENTS
24.1 Introduction 456
24.2 Estimating Relations Between Variables 457
24.2.1 Implications of Relations Held with Certainty 459
24.2.2 Uncertainty Attributable to Unexplained Heterogeneity 459
24.2.3 Uncertainty Attributable to Lack of Generalizability 460
24.3 Evaluating Moderators and Mediators of an Effect 462
24.3.1 Moderators of Main-Effect Findings 462
24.3.2 Strategies Given Moderators of Low Certainty 463
24.3.3 Mediators of Main-Effect Findings 465
24.3.4 Strategies Given Low Certainty About Mediators 467
24.4 Using Main Effects, Moderators, and Mediators to Test Theories
Competitively 467
24.4.1 Mechanisms of Competitive Theory Testing 468
24.4.2 Strategies Given Low Certainty About Competing Theories 469
24.5 Uncertain and Certain Findings Direct Subsequent Research 469
24.6 Conclusion 470
24.7 References 471

455
456 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

24.1 INTRODUCTION conventional approach, science progresses through the


cumulation of well-supported findings and theories. Al-
A research synthesis typically is not an endpoint in the though it is standard practice for researchers to call for
investigation of a topic. Rarely does a synthesis offer a additional investigation at the end of a written report,
definitive answer to the theoretical or empirical question such requests are often treated as rhetorical devices that
that inspired the investigation. Instead, most research syn- have limited meaning in themselves. In our experience,
theses serve as way stations along a sometimes winding editors publication decisions and manuscript reviewers
research path. The goal is to describe the status of a re- evaluations are rarely influenced by such indicators of
search literature by highlighting what is unknown as well sources of uncertainty that could be addressed in future
as what is known. It is the unknowns that are especially investigation.
likely to suggest useful directions for new research. Following scientific convention, research syntheses
The purpose of this chapter is to delineate the possible would be evaluated favorably to the extent that they pro-
relations between research syntheses, theory develop- duce findings and theoretical statements that appear to
ment, and future empirical research. By indicating the contribute definitively to the canon of scientific knowl-
weaknesses as well as the strengths of existing research, edge. In striving to meet such criteria, the authors of syn-
a synthesis can channel thinking about research and the- theses might focus on what is known in preference to what
ory in directions that would improve the next generation is unknown. In contrast, we believe that, by highlighting
of empirical evidence and theoretical development. In points of invalidity in a research literature, research syn-
general, scientists belief about what the next steps are for theses offer an alternative, generative route for scientific
empirical research and theorizing following a synthesis progress. This generative route does not produce valid
depends on the level of certainty that they accord its find- findings and theoretical statements but instead identifies
ings (see Cooper and Rosenthal 1980). In this context, points of uncertainty and thus promising avenues for sub-
certainty refers to the confidence with which the scien- sequent research and theorizing that can reduce the uncer-
tific community accepts empirical associations between tainty. When giving weight to the generative contribution
variables and the theoretical explanations for these asso- of a synthesis, journal editors and manuscript reviewers
ciations. would consider how well a synthesis frames questions.
As we explain in this chapter, the certainty accorded to By this metric, research syntheses can contribute signifi-
findings is influenced by the perceived validity of the re- cantly to scientific progress even when the empirical find-
search. In a meta-analysis, invalidity can arise at the level ings or theoretical understanding of them can be accorded
of the underlying studies as well as at the meta-analytic, only limited certainty.
aggregate level. Threats to validity, as William Shadish, Certainty is not attached to a synthesis as a whole but
Thomas Cook, and Donald Campbell defined it, reduce rather to specific claims, such as generalizations, which
scientists certainty about the empirical relations and the- vary in their truth value. This chapter will help readers
oretical explanations considered in a meta-analysis, and identify where uncertainty is produced in synthesis find-
this uncertainty in turn stimulates additional research ings and how it can be addressed in future empirical re-
(2002). Empirical relations and theoretical explanations search and theorizing. We evaluate two goals for research
in research syntheses range from those that the scientific syntheses and consider the guidance that they provide to
community can accept with high certainty to those it can subsequent research and theory. One is to integrate stud-
hold with little certainty only. Highly certain conclusions ies that examined a relation between variables in order to
suggest that additional research of the kind evaluated is establish the size and direction of the relation. Another is
not required to substantiate the focal effect or validate the to identify conditions that modify the size of the relation
theoretical explanation, whereas less certain conclusions between two variables or processes that mediate the rela-
suggest a need for additional research. tion of interest.
Our proposal to evaluate the uncertainty that remains Meta-analyses that address the first goal are designed
in empirical relations and in interpretations of those rela- to aggregate results pertaining to relations between vari-
tions challenges more standard ways of valuing research ables so as to assess the presence and magnitude of those
findings. Generally, evaluation of research, be it by jour- relations in a group of studies. Hence these investigations
nal editors or other interested scientists, favors valid em- include a mean effect size and confidence interval and a
pirical findings and well-substantiated theories. In this test of the null hypotheses, and consider whether the rela-
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 457

tion might be artifactual. In these analyses, the central as a main effect when the impact of one variable on an-
focus is on establishing the direction and magnitude of a other is evaluated independently of other possible predic-
relationship. This approach is deductive to the extent that tors. Of course, some single syntheses report multiple
the examined relation is predicted by a theoretical ac- sets of relationships between variables. For example, in a
count. It is inductive to the extent that syntheses are de- synthesis of the predictors of career success, Thomas Ng
signed to establish an empirical fact as, for example, in and his colleagues related eleven human capital variables,
some meta-analytic investigations of sex differences and four organizational sponsorship variables, four sociode-
of social interventions and medical and psychological mographic variables, and eight stable individual differ-
treatments. As we explain, uncertainty in such syntheses ences variables to three forms of career success: salary,
could arise in the empirical relation itself, as occurs when promotion, and career satisfaction (2005). Each of these
the outcomes of available studies are inconclusive about relationshipsfor example, between organizational ten-
the existence of a relationship, as well as in ambiguous ure and salarycould be considered a meta-analysis in
explanations of the relation of interest. and of itself. Including all of these meta-analyses in a
Meta-analyses that address the second goal use charac- single article allows readers to compare the success with
teristics of studies to identify moderators or mediators of which many different variables predict a set of outcomes.
an empirical relation. Moderating variables account for Syntheses designed to detect empirical relations in
variability in the size and direction of a relation between these ways yield estimates that can be accepted with high
two variables, whereas mediators intervene between a pre- degrees of certainty when the relation between the vari-
dictor and outcome variable and represent a causal mecha- ables is consistent across the included studies. Such syn-
nism through which the relation occurs (see Kenny, Kashy, theses provide a relatively precise estimate of the magni-
and Bolger 1998). Like investigations of main effects, the tude of the relation in the specified population, regardless
investigation of moderators and mediators is deductive of whether that magnitude is large, small, or even null.
when the evaluated relations are derived from theories, The synthesist might first calculate a confidence interval
and inductive when the moderators and mediators are around the mean effect to determine whether it is statisti-
identified by examining the findings of the included stud- cally significant, as indicated by an interval that does not
ies. Uncertainty in these meta-analyses can emerge in the include zero. He or she also would assess the consistency
examined relations and in the explanations of findings of the findings by computing one of the homogeneity
yielded by tests of moderators and mediators. tests of meta-analytic statistics (see chapter 14, this vol-
We conclude with a discussion of how research synthe- ume). Failure to reject the hypothesis of homogeneity sug-
ses use meta-analytic estimates of mean effect sizes and gests that study findings estimate a common population
evaluations of mediators and moderators in order to test effect size. Certainty about the conclusions from the esti-
competing theories. Just as testing theories about psycho- mate of the size, of the confidence interval, and the test of
logical and social processes lends primary research its homogeneity of effects requires that these calculations be
complexity and richness, meta-analyses also can be de- based on enough studies to provide statistically powerful
signed to examine the plausibility of one theoretical ac- tests. Moreover, the studies should represent the various
count over others. Meta-analyses with this orientation conditions under which the empirical relation is expected
proceed by examining the overall pattern of empirical to obtain, including different laboratories, subject popula-
findings in a literature, including the relation of interest tions, and measuring instruments. Findings that are homo-
and potential mediators and moderators of it, with the geneous given adequate power and an appropriate range
purpose of testing competing theoretical propositions. of conditions suggest an empirical result that can be ac-
Uncertainty in this type of meta-analysis emerges to the cepted with some certainty.
extent that the data evaluated cannot firmly discriminate The specific inferences that can be made from homoge-
between the theories based on their empirical support. neous effect sizes depend on whether the analytic strategy
involves fixed or random effects models for estimating
24.2 ESTIMATING RELATIONS BETWEEN effect size parameters (see chapters 15 and 16, this vol-
VARIABLES ume). One error model involves fixed-effect calculations
in which studies in a synthesis differ from each other only
For many meta-analyses, a critical focus is to demon- in error attributable to the sampling of units, such as par-
strate that two variables are related. This relation is known ticipants, within each study. Fixed-effect model results
458 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

are generalizable only to studies similar to those included cepted with some certainty, additional research to esti-
in the synthesis. Random-effects models offer the alter- mate the relation in comparable populations and settings
native assumption that studies in the meta-analysis are is not likely to provide further insight. Nonetheless, the
drawn randomly from a larger population of relevant re- specific protective mechanism of fruit consumption was
search. In this case, variability arises from the sampling not identified, and the synthesists noted the need for ad-
of studies from a broader population of studies as well as ditional research to identify the protective mechanism in-
from the sampling of participants within studies. Random- volved. As we explain in section 24.3, this issue could be
effects model results can be generalized to the broader addressed in syntheses or in primary research identifying
population of studies from which the included studies mediators and moderators of the effects of food consump-
were drawn. Thus, certainty concerning a homogeneous tion on stroke risk.
effect can be held with respect to the specific studies in a Certainty also is accorded to the slightly more complex
meta-analysis (fixed-effects models) or a broader set of outcome in which homogeneity is obtained after the re-
possible participants, settings, and measures that repre- moval of a small number of outlying effect sizes. The
sent the full population of potential research studies (ran- empirical relation could still warrant high certainty if ho-
dom-effects models). mogeneous findings emerged in all except a few studies.
Some meta-analyses not only address the size of a rela- Given the vagaries of human efforts in science, a few dis-
tion but also present the relation in specific metrics that crepant outcomes should not necessarily shake ones con-
can be interpreted in real-world terms. Pharmacological fidence in a coherent underlying effect. For example, Alex
investigations, for example, sometimes report dose-re- Harris synthesized the literature on expressive writing
sponse relationships that describe the specific outcomes about stressful experiences and identified one anoma-
obtained at different dosage levels of a drug or treatment. lously large effect in the fourteen available studies (2006).
Even if no single study explored multiple treatment lev- After it was removed, a homogeneous result was obtained
els, studies with similar magnitudes of a treatment can be such that writing experiences significantly reduced health-
grouped and comparisons made across magnitude group- care use in healthy samples. Sometimes synthesists can
ings to identify specific dose-response relations. Illustrat- detect the reasons for aberrant findings, such as atypical
ing this approach, Luc Dauchet, Philippe Amouyel, and participant samples or research procedures. Other times,
Jean Dallongeville meta-analyzed seven studies that esti- as with Harriss synthesis, the reasons for the discrepancy
mated the effects of fruit and vegetable consumption on are not clear. In section 24.2.2, we explain that future re-
risk of stroke across an average 10.7 years (2005). For search would be useful to identify why some outcomes do
studies that assessed fruit consumption, each additional not fit the general pattern. In general, findings that are
portion of fruit per day was associated with an 11 percent homogeneous except for a few outlying effect sizes sug-
reduction in the risk of stroke. For studies that assessed gest that further research in the same paradigm would be
fruit and vegetable consumption, each additional portion redundant, especially if the causes of the discrepant find-
was associated with a 5 percent reduction, and for studies ings are discernable and irrelevant to the theories or prac-
that assessed vegetable consumption, each portion was tice issues under consideration.
associated with only a 3 percent reduction. Furthermore, Clear outcomes of the types we have discussed were
fruit consumption, though not vegetable or fruit and veg- obtained in Marinus van Ijzendoorn, Femmie Juffer, and
etable combined, had a linear relation to stroke risk, con- Caroline Poelhuiss investigation of the effects of adop-
sistent with a causal effect of dose on response. The re- tion on cognitive development and intellectual abilities
ported homogeneity in effects can be interpreted with (2005). This policy-relevant question motivated a focus
some certainty despite the small number of studies be- on effect magnitude. When these synthesists compared
cause each study used large numbers of respondents, the adopted children with their unadopted peers left be-
thereby providing relatively stable estimates, and because hind in an institution or birth family, the adopted children
the studies included samples from Europe, Asia, and showed marked gains in IQ and school achievement.
America and therefore the different diets typical of each These effects were significantly different from zero and
region. Adding to the apparent validity of study findings, proved homogeneous across the included studies. Thus,
the effects for fruit consumption remained even after con- adopted children were cognitively advanced compared
trolling for a number of lifestyle stroke risk factors such with their unadopted peers, and this conclusion can be
as exercise and smoking. Because the results can be ac- accepted with some certainty given the conditions repre-
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 459

sented in the included studies. Whether the cognitive ben- for tests of effect magnitude indicate the required number
efits of adoption were enough to advance achievement of studies to detect an effect, given the average number of
levels to equal those of environmental peers, such as sib- participants per study in that literature. Meta-analyses
lings in the adopting family, was less clear, however. As producing highly certain conclusions also provide het-
we explain in section 24.3, the uncertainty associated erogeneity estimates useful to calculate the number of
with this aspect of the findings indicates a need for addi- studies appropriate to detect heterogeneity in subsequent
tional investigation. syntheses. In this way, effects held with high certainty
provide standards of magnitude and variability that can
24.2.1 Implications of Relations Held inform subsequent research designed to estimate empiri-
with Certainty cal relations.

Meta-analytic investigations produce empirical relations 24.2.2 Uncertainty Attributable to Unexplained


associated with high certainty when they meet the noted Heterogeneity
criteria, that is, that they have ample power to detect het-
erogeneity and that they aggregate across a representa- Meta-analyses designed to identify relations between vari-
tive sampling of settings, participants, and methods for the ables often find that the included effect sizes are heteroge-
particular research question. In such cases, meta-analyses neous, and homogeneity is not established by removal of
provide guidance for estimating the expected magnitude relatively few outlying effect sizes. In other words, studies
of effects in subsequent primary studies and meta-analy- have produced heterogeneous estimates of a given rela-
ses. Researchers armed with knowledge of the size of ef- tion. Significant heterogeneity that arises from unexplained
fect typical in a research paradigm, for example, can use differences between findings threatens statistical conclu-
appropriate numbers of participants in subsequent studies sion validity, defined as the accuracy with which investi-
to ensure adequate power to detect effects of that size. gators can estimate the presence and strength of the rela-
Especially useful for allowing researchers to gauge the tion between two variables (Shadish, Cook, and Campbell
magnitude of effects are syntheses that aggregate the 2002). Heterogeneity also threatens the construct validity
findings of multiple existing meta-analyses. For example, of a meta-analyzed relation, defined as how accurately
F. D. Richard, Charles Bond, and Juli Stokes-Zoota com- generalizations can be made about theoretical constructs
piled 322 meta-analyses that had addressed social psy- on the basis of research operations. Threats to construct
chological phenomena (2003). Their estimates suggest validity lower confidence in the theoretical interpretation
the effect size that can be expected in particular domains of the results, because it is unclear whether the various
of research. Another aggregation of meta-analytic effects, research findings represent a common underlying phe-
reported by David Wilson and Mark Lipsey, investigated nomenon.
how various features of study methodology influenced Illustrating lowered confidence that comes from threats
the magnitude of effects in research on psychological, to construct validity is Daphne Oyserman, Heather Coon,
behavioral, and educational interventions (2001; Lipsey and Marcus Kemmelmeiers meta-analysis of the cultural
and Wilson 1993). Their estimates across more than 300 orientations of individualism and collectivism (2002).
meta-analyses revealed the outcomes that could be ex- The synthesists compared three methods for operational-
pected from methodological features of different research izing these constructs: direct assessments by psychologi-
designs. cal scales administered to individuals, use of the culture
When meta-analyses provide estimates held with a or country values of these constructs as identified by
high degree certainty, they can inform future research by Geert Hofstede (1980), and priming by experimental ma-
providing information to calculate power analyses rele- nipulations to enhance the accessibility of one or the
vant to follow-up in primary research as well as meta- other construct. The three operations did not always pro-
analytic investigations (Cohen 1988; Hedges and Pigott duce the same effects, and, when common effects did
2001). Once the expected effect size is known, power cal- arise, they could not with any certainty be attributed to
culations can identify with some precision the sample the same psychological mechanism. Given this pattern of
size needed to detect effects of a given size in future re- findings, Oyserman and her colleagues wisely called for
search using similar operations and subject samples. With research and theorizing that differentiated among the
respect to subsequent meta-analyses, power calculations various meanings of individualism and collectivism and
460 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

that developed appropriate operations to assess each of not extremely narrow because most meta-analyses aggre-
these. gate across multiple operationalizations of constructs,
It is unknown whether many early syntheses produced multiple periods, and multiple samples. They thus pos-
unexplained variability in effect sizes because these often sess greater external validity than any single study. Exter-
did not report homogeneity statistics. However, computer nal validity refers to the generalizability of an effect to
software now provides synthesists with ready access to other groups of people or other settings (Shadish, Cook,
such techniques for assessing uniformity of synthesized and Campbell 2002).
findings. Correspondingly, standards have changed so The studies within a synthesis may, however, share one
that most published meta-analyses now contain reports of or more limitations of their design, execution, or report-
the variability in study outcomes and use this information ing. For example, they may all use the same limited par-
to interpret results. ticipant population (such as whites or males), or they may
When faced with significant unexplained variability in all be conducted in a single setting (such as experimental
effects, meta-analysts have several options. These begin laboratories). Given such sample limitations, the external
with whether fixed- or random-effects models will be validity of the empirical relations demonstrated by the
used to estimate effect size parameters (see chapter 3, this synthesis may be compromised.
volume). When analyses were initially conducted using a Limitations in the sample of studies that threaten exter-
fixed-effects model, one option was to rethink the as- nal validity can be revealed by presenting study attributes
sumptions of the analysis and, if appropriate, to use a ran- in tables or other displays to show their distribution (see
dom effects approach that assumed more sources of vari- chapter 26, this volume). Systematic presentation of study
ance but still a single underlying effect size. With random attributes also can reveal limits in the methods of opera-
effects models, the mean effect sizes, confidence inter- tionalizing variables within a literature. Because it is rare
vals, significance tests, and homogeneity analyses are all that any one method of operationalizing a variable com-
computed with an error estimate that reflects the random pletely captures the meaning of a given concept, synthe-
variability that arises from both within-studies and be- ses generally possess poor construct validity when stud-
tween-studies sources. Thus, even if the hypothesis of ies uniformly use a particular measurement technique or
homogeneity is rejected in the fixed effects model, the experimental manipulation (see chapter 2, this volume).
data may prove consistent with the assumption of a single Given that the resulting meta-analytic findings may be
underlying effect size in a random effects model, and the limited to that particular operationalization of the con-
conceptual interpretation of the effect can be held with struct and may not be obtained otherwise, they should be
some certainty. accepted with little certainty.
A second, separate option to achieve an interpretable Meta-analyses can usefully guide subsequent research
empirical relation when study outcomes are heteroge- by revealing uncertain conclusions due to limited varia-
neous is to evaluate features of the research that moderate, tions in study design and implementation or to limited
or covary with, the effect of interest. Moderator analyses samples intended to represent broader target populations.
are common within both a fixed-effects and a random-ef- Subsequent primary research might examine whether the
fects analyses. As we explain in section 24.3, when the effect extends to underinvestigated populations of partici-
systematic variability in the study findings can be satisfac- pants and research settings. Emphasizing this concern,
torily explained merely by including such third variables, John Weisz, Amanda Doss, and Kristin Hawley noted
relatively high certainty can be accorded the obtained em- that a number of meta-analyses of youth psychotherapy
pirical relation as it is moderated by another variable or have documented substantial evidence of success in treat-
set of variables. ing diverse problems and disorders (2005). The empirical
base for such conclusions remains somewhat limited,
24.2.3 Uncertainty Attributable to Lack however, with, for example, 60 percent of the studies
of Generalizability Weisz and his colleagues identified failing to report sam-
ple ethnicity (2005), and those doing so including ap-
Even when a meta-analysis produces an empirical rela- proximately 57 percent white participants, 27 percent
tion for which the homogeneity of effects cannot be re- black, 5 percent Latino, 1 percent Asian, and 9 percent
jected, high certainty should be attached only within the other ethnicities. Also, more than 70 percent of the stud-
limits of the included studies. Typically, these limits are ies provided no information about sample income or so-
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 461

cioeconomic status. Thus it remains uncertain whether conference presentations include studies that will never
the treatment gains documented in recent meta-analyses be submitted for publication, perhaps because they do not
of youth psychotherapy generalize across ethnic groups meet the high standards of journal publication. By ac-
and the socioeconomic spectrum. cessing this grey or fugitive literature, synthesists would
Another reason that meta-analyses may have limited obtain a more representative sample of the studies that
certainty is that synthesists have confined their tests to have been conducted.
one or a small subset of the possible operationalizations Even if multiple databases are searched and grey or
of conceptual variables. These omissions can help to fugitive literature is included, some studies remain inac-
guide subsequent research. Even though high certainty cessible because their findings were never written up,
might be achieved with the specific operations available, generally because their findings were not promising. Re-
follow-up research could be designed to represent con- searchers anticipation of publication bias in favor of sta-
ceptual variables even more optimally. This strategy is tistically significant findings lowers their motivation to
illustrated by Marinus van Ijzendoorn and his colleagues write up studies (Greenwald 1975), and, even without the
meta-analysis of the cognitive developmental effects of operation of a real publication bias in the decisions of
adoption on children (2005). Some of their estimates of manuscript reviewers or editors, produces a bias in the
the effects compared an adopted samples IQ with the sample of available studies. To address this issue of file-
population value of 100 for IQ. However, this population drawer studies, meta-analysts wisely turn to quantita-
value may be an underestimate because intelligence tests tive methods of detecting publication bias (for example,
tend to become outdated as people from Western coun- Kromrey and Rendina-Gobioff 2006; see also chapter 23,
tries increasingly profit from education and information this volume). In these ways, limitations in search strate-
and thus increase the population value beyond the pre- gies or in the sample of studies evaluated can produce
sumed average score of 100. As a result, population IQ uncertainty in the conclusions drawn about synthesis ef-
values may underestimate actual peer IQ. The synthe- fects.
sists speculated that this phenomenon explains the ap- A different problem that creates uncertainty about an
parent presence of developmental delays when compar- empirical relation can arise when a meta-analysis uncov-
ing adoptees with contemporary peers and its absence ers evidence of serious methodological weaknesses in a
when comparing them with population values. Given research literature. Under such circumstances, synthesists
this validity threat, subsequent research would need to may decide that a portion of the studies in the research
assess peer intelligence in relation to IQ norms obtained provided invalid tests of the hypothesis of interest and
from contemporaneous peers, such as siblings in the thus contaminate the population of studies that provided
adopting family. valid tests. This situation arose in Alice Eagly and her
Sample limitations that threaten the external validity of colleagues (1999) examination of the classic hypothesis
meta-analytic findings also can arise from meta-analysts that memory is better for information that is congenial
inadequate searching of the available research literature with peoples attitudes (or proattitudinal) than for un-
or from limits in the available literature stemming from congenial (or counterattitudinal). Despite several early
biases in publication. Search procedures are likely to be confirmations of this hypothesis, many later researchers
inadequate when synthesists search only accessible stud- failed to replicate the finding. The meta-analysis pro-
ies. Typically, accessibility is greatest for studies in the duced an overall congeniality effect but also evidence of
published literature, and, at least for Western scientists, methodological weaknesses in the assessment of memory
studies in the English-language press. A wider selection in many early studies (for example, failure to use masked
of databases can yield a more representative sample of coding of free recall memory measures). Because of the
studies. Also, the most common databases may not ac- incomplete reporting of measurement details in many
cess the grey literature (or fugitive literature) of studies studies, it was unclear whether removal of these method-
that have not been published in journals (see chapter 6, ological sources of artifact would leave even a small con-
this volume). Such literature includes doctoral disserta- geniality effect. In follow-up experiments that corrected
tions, many of which remain unpublished because of sta- the methodological weaknesses, Eagly and other col-
tistically nonsignificant findings. Unpublished studies leagues in a subsequent study obtained consistently null
also can be accessed, at least in part, through locating congeniality effects under a wide range of conditions, thus
reports of studies presented at conferences. Ordinarily, clarifying the findings of the earlier meta-analysis (2000).
462 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

Process data revealed that thoughtful counterarguing of 24.3.1 Moderators of Main-Effect Findings
uncongenial messages enhanced memory for them, de-
spite their unpleasantness and inconsistency with prior In meta-analysis, as in primary research, findings can re-
attitudes. flect post hoc decisions that are organized by patterns that
In summary, the results of meta-analyses designed to emerge in the data or can reflect theoretically grounded, a
evaluate relations between two variables can guide subse- priori decisions. In one post hoc method, potential mod-
quent research in various ways, depending on the certainty erators of effect sizes are identified empirically from the
accorded to the synthesis findings. Effects interpreted study outcomes themselves. For example, Larry Hedges
with high certainty provide a benchmark for effective de- and Ingram Olkin suggested clustering effect sizes of
sign of subsequent research, especially for identifying similar magnitude and then evaluating the clusters of out-
appropriate sample sizes in subsequent designs to detect comes for study characteristics that might covary with the
effects. Effects interpreted with limited certainty high- groupings (1985).
light the need for additional research to address the spe- Illustrating this clustering approach is Donald Sharpe
cific aspect of invalidity that threatens certainty. When and Lucille Rossiters analysis of the psychological func-
certainty is lowered by poor external validity, subsequent tioning of the siblings of children with a chronic illness
research might establish that the effect occurs in new (2002). The synthesists noted an overall negative effect on
samples and settings. When certainty is lowered by poor sibling functioning along with significant heterogeneity
construct validity, subsequent research might demon- in study outcomes that was not accounted for by a priori
strate that the effect emerges with multiple operations, predictors. They empirically clustered studies into illness
especially those that are the most appropriate representa- types, differentiating between those illnesses associated
tions of the underlying constructs. with statistically significant debilitating effects and those
that obtained the contrary finding of a positive but nonsig-
nificant effect. The illnesses yielding debilitating effects
24.3 EVALUATING MODERATORS AND (cancer, diabetes, anemia, bowel disease) tended to be
MEDIATORS OF AN EFFECT treated with intrusive ongoing methods that would likely
be more restrictive of sibling school and play activities
Many meta-analyses are designed to examine not just a than the illnesses yielding null effects (cardiac abnormali-
simple relation between two variables but also the role of ties, craniofacial anomalies, infantile hydrocephalus). The
additional variables that may moderate or mediate that post hoc classification of studies into illness types, how-
relation. Such investigations have an empirical focus to ever, renders these moderator findings somewhat uncer-
the extent that they deductively evaluate patterns of rela- tain and suggests the need for follow-up research to vali-
tions among variables and a theoretical focus to the ex- date the moderating relation.
tent that they inductively examine theoretically relevant The distinction between a priori predictions and post
variables that might moderate or mediate the relation of hoc accounts of serendipitous relations is often not sharp
interest. As with a meta-analysis focused on the relation in practice, either because investigators portray post hoc
between two variables, the certainty of the conclusions findings as a priori or because they have theoretical
about moderators or mediators depends on the validity of frameworks that are flexible enough to encompass a wide
the demonstrated empirical relations as well as on the range of obtained relations. Nonetheless, moderator anal-
theoretical interpretation of those relations. yses not derived from theory can be unduly influenced by
Interpretations of empirical relations should be ac- chance data patterns and thereby threaten statistical con-
corded little certainty when moderator or mediator vari- clusion validity. Even though meta-analytic statistical
ables are identified from post hoc or unanticipated discov- techniques can adjust for the fact that post hoc tests capi-
eries in the data analysis. Uncertainty also is attached to talize on chance variability in the data, less certainty
moderating and mediating variables that are not sufficient should be accorded to the empirical relations established
to account for heterogeneity in study outcomes. When un- with moderators derived from post hoc tests than to those
explained heterogeneity remains in predictive models, the developed within an explicit theoretical framework (for
certainty of scientists theoretical interpretations declines post hoc comparisons, see Hedges and Olkin 1985). This
because of threats to construct validity of the moderating low level of certainty points to the need for follow-up re-
and mediating variables. search to validate the moderating relation in question.
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 463

Moderators also have uncertain empirical status when factors. For example, an effect of publication year might
they are unable to account for all of the variability in a be a proxy for temporal changes in the cultural or other
given empirical relation. Even Sharpe and Rossiters em- factors that affect the phenomenon under investigation.
pirical approach by which they identified the type of ill- This mechanism is especially plausible as an explanation
ness moderator from study outcomes failed to account for relations between publication year of the research re-
adequately for the variability across included studies ports and the size and magnitude of sex differences in
(2002). The remaining unexplained variability within the social behavior. In recent years, womens increased par-
moderator groupings provides an additional reason for ticipation in organized sports, employment outside the
follow-up research. home, and education has co-occurred with reduced ten-
As we argued with respect to estimation of relations dencies for research to report that men take greater risks
between variables, uncertainty in the empirical evidence than women (Byrnes, Miller, and Schafer 1999) and that
for an effect inherently conveys uncertainty in interpreta- men are more assertive than women (Twenge 2001). Other
tion. However, uncertainty in interpretation can emerge sex differences, however, have varied with year in a non-
from a variety of sources in addition to empirical uncer- linear fashion that does not correspond in any direct way
tainty. Uncertain interpretations of moderating variables with social change or with known variation in research
could emerge even when moderators are identified a pri- methodsfor example, the sex difference in smiling was
ori and successfully account for the empirical variability smaller in data collected during the 1960s and 1980s than
in study outcomes. This kind of conceptual ambiguity is in other periods (LaFrance, Hecht, and Paluck 2003).
especially likely when moderators are identified on a be- Other effects of publication year could be attributable
tween-studies basisthat is, when some studies repre- to publication practices in science. For example, Michael
sent one level of the moderator, and others represent other Jennions and Anders Mller evaluated the effects of pub-
levels. With this strategy, the moderator variable typically lication year in forty-four meta-analyses covering a wide
is an attribute of a whole research report, and participants range of topics in ecological and evolutionary biology
in the primary research have not been randomly assigned (2002). They reported that across meta-analyses, effect
to levels of the moderator. Analyses on such synthesis- size was related to year of publication such that more re-
generated variables (see chapter 2, this volume) do not cently published studies reported weaker findings. These
have strong statistical conclusion validity because the year effects could reflect that larger, statistically signifi-
moderator variable may be confounded with other attri- cant study outcomes within a research literature are pub-
butes of studies that covary with it. lished earlier, given that they have shorter time-lags in
Moderators identified on a between-study basis can be submission, reviewing, and editorial decisions. As Jen-
contrasted to within-study moderator analyses in which nions and Mller argued, these shorter time lags could
two or more levels of the moderator are present in indi- reflect publication biases in science that favor statistically
vidual studies. Analyses of within-study moderators can significant findings. They noted that such a publication
be interpreted with greater certainty, especially when re- bias could also explain why the effect sizes were smaller
searchers manipulated the moderators in the underlying in the studies with larger samples. Presumably, studies
primary studies and participants were assigned randomly with small samples and small effects were missing be-
to levels of the moderator variable. Under these circum- cause their statistically nonsignificant findings render
stances, moderator effects can be attributed to the ma- them unlikely to meet the standards for publication. In
nipulated variables. this account, year and sample size serve as proxies for the
Especially low levels of certainty in theoretical expla- true causal factor, which is the publication bias shared by
nation are associated with moderators that serve as proxy authors, peer reviewers, and journal editors. Despite the
variables for other, causal factors. For example, synthe- plausibility of this account, proxy variables such as year
sists have sometimes examined how the effects reported cannot establish the effects with a high level of certainty.
in included studies vary with those reports publication
year and identity of authors. Although the empirical rela- 24.3.2 Strategies Given Moderators
tion between such variables and the effect of interest may of Low Certainty
be well established, its interpretation usually is not clear
because the moderating variable cannot itself be regarded We have argued that uncertainty can arise in moderator
as causal but instead is a stand-in for the actual causal analyses when moderators are identified post hoc, are
464 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

proxies for other causal factors, or are tested by between- that have been identified. For example, Alice Eagly and
study comparisons. These approaches yield instability in Steven Karau found that the tendency for men to emerge
the obtained empirical relations as well as ambiguity in as leaders more than women in initially leaderless small
the correct conceptual interpretation of the moderator. groups was stronger with masculine than feminine group
Subsequent research can be designed to address these tasks (1991). This finding, however, presented some ambi-
sources of uncertainty. guity because the gender-typing of tasks was confounded
When uncertainty arises because moderators are as- with other features of studies methods, and the primary
sessed on a between-studies basis, sometimes such analy- researchers had determined gender typing in differing
ses can be validated within the same synthesis by within- ways. Barbara Ritter and Janice Yoder therefore con-
study analyses. These opportunities arise when a few ducted an experiment in which mixed-sex pairs expected
studies in a literature manipulate a potential moderator to perform a planning task that was identical except for its
and the remaining studies can be classified as possessing masculine, feminine, or gender-neutral topic (2004). The
a particular level of that moderator. For example, Wendy usual gender gap favoring male leadership was present
Wood and Jeffrey Quinns synthesis of research on the with masculine and gender-neutral tasks but reversed to
effects of forewarning of impending influence appeals re- favor female leadership with feminine tasks. This finding
vealed that research participants tended to shift their at- confirmed the moderation of leader emergence by task
titudes toward an appeal before it is delivered, apparently sex-typing and secured the conceptual interpretation that
in a pre-emptive attempt to reduce the appeals influence access to leadership depends on the congruity of the par-
(2003). Between-study moderator analyses suggested ticular leadership role with gender stereotypical expec-
that this shift emerged primarily with message topics that tations.
were not highly involving. However, the effects of in- Sometimes the correct interpretation of any single
volvement could be established with limited certainty in moderator in a given synthesis is uncertain, but greater
the between-study comparisons; only a single study used certainty emerges when the results are viewed across a
a highly involving topic. Fortunately, Wood and Quinn set of moderating variables. This situation occurred in
were able to validate the involvement effect with supple- Wood and Quinns synthesis (2003). They evaluated how
mental within-study analyses that compared effect sizes warning effects on attitudes are moderated by a number
in the high and low involvement conditions of studies that of factors associated with participants active defense of
experimentally varied participants involvement with the their existing views. The included research varied a di-
topic of the appeal. When corroborated in this way, verse set of factors to encourage defensive thought about
greater certainty can be accorded to the empirical rela- the appeal, including instructing participants to list their
tions established by between-study moderator analyses thoughts before receiving the message and using message
and to their conceptual interpretation. topics high in personal involvement. Consistent with the
Interpretations given to proxy variables have particu- interpretation that these experimental variations did in
larly low certainty because they are both post hoc mod- fact increase recipients counterarguing to the impending
erators and evaluated typically on a between-studies message, they all increased attitudinal resistance to the
basis. To address these points of uncertainty, subsequent appeal. This convergence of effects across a number of
research is needed to demonstrate the replicability of the theoretically relevant variables promotes greater certainty
variables covariation with study outcomes as well as to in interpreting moderating relations than would be pos-
clarify its interpretation. Explaining the impact of a proxy sible with examination of any single variable alone.
variable can be quite difficult because it might reflect a In summary, moderator variables that synthesists dis-
wide range of other variables. There may be few guides cover through empirical clustering on a post hoc basis or
for selecting any particular candidates to pursue in subse- test through between-studies analyses can be informative,
quent research. As we explain in the next section, one particularly because such analyses can demonstrate rela-
possible strategy is to examine study attributes or other tions not investigated in primary research. Nonetheless,
variables that might plausibly mediate the effects of the substantial uncertainty about the stability of the empirical
proxy variable on study outcome. relations can arise from the threats to statistical conclu-
When uncertainty arises in the conceptual interpreta- sion validity inherent in such analyses. Also, uncertainty
tion of moderators, the optimal strategy for follow-up in theoretical accounts can arise from the low construct
research is experimentation to manipulate the constructs validity inherent in such analyses, given the possible con-
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 465

founding influences of unacknowledged variables. Espe- uted to their success at changing these particular beliefs
cially low levels of certainty should be accorded to the with respect to condom use. However, despite the strong
interpretation of variables such as author sex and year of empirical support for mediation, the models were tested
publication that serve as proxies for other, more proximal with correlational data. The potential threat to statistical
predictors. Given moderators that yield low empirical conclusion validity in correlational designs suggests the
certainty, one strategy is to replicate the moderators im- need for additional research to manipulate the mechanism
pact in within-study moderator tests, and another is to in question.
manipulate the moderating variables in subsequent ex- Mediational analyses also can be applied to account
perimentation. A third strategy is to examine convergence for the effects of proxy variables such as date of publica-
of effects across sets of possible moderators. Yet a fourth tion and researcher identity. In a demonstration of how
strategy, as we explain in the next section, is to conduct the effects of year of publication or data collection can
mediational analyses that examine study characteristics reflect changes over time in the phenomenon being inves-
that covary with the moderator. tigated, Brooke Wells and Jean Twenge reported signifi-
cant shifts over time in sexual activity, especially among
24.3.3 Mediators of Main-Effect Findings women (2005). From 1950 to 1999, the percentage of
sexually active young people in the United States and
Meta-analyses can meet the dual goals of clarifying em- Canada increased, and the age at first intercourse de-
pirical relations and identifying theoretical interpreta- creased. Wells and Twenge explored whether generational
tions by evaluating the mediators, or causal mechanisms, challenges, such as increasing divorce rates, are respon-
which underlie an effect. A strong demonstration of me- sible for these changes. They found a positive relation
diation requires a closely specified set of relations be- between incidence of divorce and percentage of young
tween the mediator and the main-effect relation to be ex- people who were sexually active and approved of pre-
plained (see Kenny, Kashy, and Bolger 1998). In brief, a marital sex. The generational changes did not have uni-
mediating effect can be said to obtain when the following form effects across all indicators of sexuality, however,
pattern occurs: an independent variable such as a health and divorce had little relation to age at first intercourse.
intervention program significantly influences a dependent The inconsistent evidence that cohort changes are linked
variable such as exercise activity, a third, potentially me- to sexuality indicators renders the cause of the year ef-
diating variable, such as self-efficacy to exercise also in- fects uncertain, and future research is required to assess
fluences the exercise dependent variable, and the effect additional causal factors.
of the intervention program in increasing exercise is at When the proxy variable that predicts study outcomes
least partially accounted for by increases in self-efficacy. is research group, a promising approach to explaining
Whenever several primary studies within a literature have these findings is to demonstrate that research group cova-
evaluated a common mediating process, meta-analytic ries with different research practices such as the preferred
models can be constructed to address each of the rela- methods, data analytic procedures, or strategy for report-
tions involved in a mediation model (see chapter 20, this ing findings. In the absence of such a demonstration, an
volume). innovative procedure for disentangling discrepancies be-
Illustrating mediation tests with meta-analytic data, tween research groups findings entails joint design of
Marta Durantini and her colleagues examined the mecha- crucial experiments by the antagonists, a procedure Gary
nisms through which highly expert sources exert influ- Latham, Miriam Erez, and Edwin Locke used to success-
ence on health topics, in particular condom use (2006). fully resolve conflicting findings on the effect of work-
In general, expert source messages increased condom ers participation on goal commitment and performance
use more than less expert ones. To identify the underlying (1988). Drawing on the spirit of scientific cooperation,
psychological processes, the synthesists predicted the the researchers who produced the discrepant findings in
magnitude of the expertise effect from the possible medi- this area cooperated in the design and execution of ex-
ating beliefs recipients held. The influence of expert periments in order to identify critical points of divergence
sources on recipients behavior change proved mediated that accounted for the effects of research group. Espe-
by changes in recipients perceptions of norms, perceived cially subtle differences in methods not discernible from
control, behavioral skills, and HIV knowledge. Experts research reports can be revealed by such joint primary
ability to bring about behavior change can thus be attrib- research.
466 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

Unfortunately, synthesists rarely have many opportu- components of the process relations. For example, Beth
nities to assess mediation. A set of primary studies as- Lewis and her colleagues examined mediators of the ef-
sessing a common mediating process is most likely to be fects of health interventions to increase exercise on peo-
found when research has been conducted within a domi- ples actual exercise activity (2002). Although they did not
nant theoretical framework. More commonly, because have enough data from individual studies to build com-
only a few studies in a literature yield data concerning plete mediational models, they evaluated the two compo-
common mediators of an effect, only a few relations nents of mediation. First, they examined the effects of the
among mediators and the effect can be evaluated. In this interventions on potential mediators, including behav-
circumstance, mediator findings provide only suggestive ioral processes (such as substituting alternative responses,
evidence for underlying mechanisms. This situation arose stimulus control), self-efficacy beliefs, cognitive processes
in a meta-analysis of the relation between self-esteem (such as increasing knowledge, understanding risks), so-
and how easily people can be influenced (Rhodes and cial support, and perceived pros versus cons of exercise.
Wood 1992). A number of studies in the synthesis as- Second, they examined whether changes in the media-
sessed message recipients recall of the persuasive argu- tors coincided with changes in exercise activity. They con-
ments in the influence appeal as well as opinion change to cluded that the most consistent evidence supported behav-
the appeal. Lower self-esteem was associated with lesser ioral processes as a mediator of successful interventions.
recall as well as with lesser opinion change compared However, in the absence of mediator data that examined
with results for moderate self-esteem. People with low the full sequence of events involved in the causal model,
esteem were apparently too withdrawn and distracted to a meta-analytic combination of studies assessing compo-
attend to and comprehend the messages and were not in- nent parts does not conclusively demonstrate the full pro-
fluenced by them. The parallel effects for recall and opin- posed model. Because building process models at a meta-
ions provide suggestive support for models that empha- analytic level from separate tests of a models components
size message reception as a mediator of opinion change threatens internal validity, such analyses cannot be inter-
(McGuire 1968). However, because enough data were not preted with high certainty.
available to examine the link between reception and in- Another indirect method for evaluating mediators in-
fluence, some uncertainty clouds this interpretation. In volves judges coding or rating of process-relevant as-
general, the internal validity of conclusions about a medi- pects of the synthesized studies. Process ratings are espe-
ating process is threatened to the extent that a research cially useful when the included studies are drawn from
synthesis is unable to examine the full range of relations diverse theoretical paradigms, and no single mediational
involved in the hypothesized mediational process. process was assessed by more than a few studies. In this
Sometimes the likely mechanism of mediation in a case, process ratings may make it possible to test a gen-
sythnesis is suggested by the pattern of study outcomes. eral causal model across the full range of included ex-
For example, in a meta-analysis of epidemiological stud- perimental paradigms. For example, David Wallace and
ies examining the effects of tea consumption on colorec- his colleagues used process ratings to test the idea that the
tal cancer risk, Can-Lan Sun and her colleagues reported directive effects of attitudes on behavior are weakened by
sex differences in the protective effects of black tea (2006). situational constraints, such as perceived social pressure
That is, a significant risk reduction emerged for women and perceived difficulty of performance (2005). Con-
but a nonsignificant risk enhancement effect emerged for straints were estimated through contemporary students
men. The synthesists speculated that the effects for gen- ratings of the likelihood that others close to them would
der were attributable to hormonal mediators, given that in want them to engage in each of the behaviors in the in-
direct tests, sex hormones have been implicated in protec- cluded research and that each behavior would be difficult
tion against colorectal cancer. Such a mediational model to perform. As expected, attitudes were not as closely re-
would need to be tested directly, however, to be accepted lated to behavior when the student judges reported stron-
with any certainty. ger social pressure or difficulty. The synthesists concluded
In general, when studies do not provide measures of that behavior is a product of both situations and attitudes,
mediation, several indirect meta-analytic methods can be and the relative impact of any one factor depends on the
used to test hypotheses concerning mediation. One such other. Of course, process ratings do not directly substanti-
method involves constructing general process models at ate mediating processes. Although such judgments may
the meta-analytic level from studies that have examined have high reliability and stability, and their empirical rela-
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 467

tions to effect sizes may be strong, they are not necessar- to identify causal relations, we also advocate a second
ily unbiased. Such judgments may not accurately estimate step of directly manipulating the hypothesized mediating
the processes that took place for participants of studies process in primary research to document its causal im-
conducted at earlier dates. For these and other reasons, pact on the dependent variable. Inviting such follow-up
judges ratings of mediators do not yield high certainty in research is, for example, Sun and her colleagues finding
theoretical interpretation. about colorectal cancer risk, gender, and black tea (2006).
Uncertainty concerning the empirical status of mediat- If the protective effects are attributable to female sex hor-
ing processes underlying an effect is a frequent occur- mones, then follow-up research with animals that directly
rence in research syntheses for the reasons we have stated. manipulates tea consumption and hormone levels and ex-
We discussed three problems that may create uncertainty amines the effects on cancer incidence should provide
about mediation in meta-analytic syntheses. Mediational evidence of the mediating role of hormones on risk reduc-
models may rely exclusively on correlational data, be tion. These two research strategies establish mediation
constructed from primary research that tested only sub- with a high level of certainty by providing complemen-
sets of the overall model, and use judges ratings to repre- tary evidence: correlational assessment of the mediator
sent mediating variables. demonstrates that it varies in the predicted manner with
the relation of interest, and experimental manipulation of
24.3.4 Strategies Given Low Certainty the mediator demonstrates its causal role. Accomplish-
About Mediators ing the correlational assessment in a meta-analysis takes
advantage of this techniques ability to document main-
When mediational models cannot be established with effect relations and to suggest possible process models.
high certainty, the causal status of the mediator with re- Manipulating the mediator in subsequent primary experi-
spect to the relation of interest remains uncertain. Pri- ments capitalizes on the strong internal validity of this
mary research is then needed to substantiate the suggested type of research.
process models.
In research on eyewitness facial identification, investi- 24.4 USING MAIN EFFECTS, MODERATORS,
gators have pursued this strategy of following up with AND MEDIATORS TO TEST THEORIES
experimentation on the mediational models suggested in COMPETITIVELY
research syntheses. A meta-analytic synthesis by Peter
Shapiro and Steven Penrod suggested that reinstatement In recent years, research syntheses have begun to be used
of the context in which a face was initially seen facilitated to test alternative theoretical accounts for particular phe-
subsequent recognition of the face (1986). The mecha- nomena. In competitive theory testing, the analysis strat-
nisms through which context had this impact could not be egy is not inductive but is driven largely by competing
evaluated in the synthesis, and subsequent primary re- theoretical propositions. This approach benefits from the
search addressed this issue. By experimentally varying ready availability of meta-analytic statistical techniques
features of the target stimulus and features of the context, and computing programs that facilitate theory testing,
follow-up research indicated that reinstatement of con- such as programs designed to estimate path models. Such
text enhances recognition accuracy because it functions meta-analyses sometimes proceed by examining main ef-
as a memory retrieval cue (Cutler and Penrod 1988; Cut- fect relations but more often by examining potential me-
ler, Penrod, and Martens 1987). Follow-up research thus diators and moderators of effects that are deduced directly
demonstrated with some certainty a mediational model from theoretical frameworks.
that could account for the meta-analytic findings, specifi- In meta-analyses, certainty in competitive theory test-
cally that context reinstatement provided memory cues to ing is threatened when theoretically derived variables fail
assist recognition. to account adequately for the patterns of observed effects.
In general, we advocate a two-pronged strategy to doc- We have discussed some of these sources of uncertainty
ument mediational models with sufficient certainty. First, with respect to main effects, moderators, and mediators,
meta-analytic methods of evaluating the mediational pro- especially the poor construct validity that emerges when
cess are devised and correlational analyses conducted to theoretically defined variables are insufficient to account
demonstrate that the mediator covaries as expected with for the variability in effect size outcomes. Additional lim-
the effect of interest. Because mediational models claim its to construct validity arise when the included literature
468 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

does not sample the full range of the theoretical variables level of certainty as an explanation of the obtained pat-
or when the pattern of empirical effects is consistent with tern of findings.
multiple theoretical models. In addition, uncertainty is Sometimes it is possible to arbitrate between compet-
suggested by threats to internal validity, such as insuffi- ing theories by using statistical procedures that identify
cient evidence of causality among theoretical constructs. how well each set of theoretical relations fit the obtained
meta-analytic findings. This approach was taken in Judith
24.4.1 Mechanisms of Competitive Ouellette and Wendy Woods synthesis comparing deci-
Theory Testing sion making theories of action, in which behavior is a
product of peoples explicit intentions to respond, with
Competitive theory testing about main effect relations theories of habitual repetition, in which repeated behavior
might occur when one theory favors certain predictors of is cued by stable features of contexts in which it is rou-
an outcome, whereas another theory favors other predic- tinely performed (1998). These synthesists constructed
tors. Illustrating this approach, Thomas Ng and his col- path models to compare intentions and past behavior as
leagues synthesis tested whether career success is better predictors of future behavior. Consistent with decision-
accounted for by human capital theory emphasizing em- making models, behaviors such as getting flu shots, which
ployees experience and knowledge as opposed to socio- are unlikely to be repeated with enough regularity to form
political theory emphasizing employees networking ties habits, proved primarily a product of peoples intentions.
and visibility in the company or field (2005). Success in However, consistent with a habit model, behaviors such
terms of salary was best predicted by human capital vari- as wearing seat belts, which can be repeated with enough
ables, presumably because they indicate individuals worth regularity in stable contexts to form habits, were primar-
to an organization. In contrast, success in terms of pro- ily repeated without influence from intentions. This anal-
motion to higher job levels was best predicted by social ysis strategy of computing path models in different group-
factors, presumably because of the political nature of pro- ings of studies and then comparing the models is essentially
motion decisions. In this analysis, then, different theories a correlational approach. Given the threat to internal va-
appeared to account for different job outcomes. lidity in such a design, additional experimental research is
Illustrating competitive theory testing using moderator needed to test the underlying causal model.
analyses, Jean Twenge, Keith Campbell, and Craig Foster An additional illustration of statistical procedures to
evaluated theoretical explanations for the reduction in test theories competitively comes from Daniel Heller,
marital satisfaction that occurs with the advent of children David Watson, and Remus Iliess investigation of whether
(2003). They considered how well each theory accounted individual differences in life satisfaction reflect the top-
for the pattern of moderating variables in the included down effects of temperament and personality traits that
studies. In the synthesis, marital satisfaction decreased yield chronic affective tendencies or the bottom-up mech-
more for mothers than fathers, those of higher than lower anisms that reflect the impact of particular life events
SES, and parents of infants than older children. This such as work and marriage (2004). Competitive tests of
moderator pattern was consistent with two of the theories these accounts estimated meta-analytic relations between
but not with others. Specifically, the moderation could be personality (neuroticism, conscientiousness), general life
explained by a role conflict account, given that high SES satisfaction, and specific satisfaction with work and mar-
women with infants are especially likely to experience riage. A completely top-down model with personality
conflict as parenthood causes family roles to be reorga- predicting each of the types of satisfaction failed to fit the
nized along traditional lines. Alternatively, the modera- data. Instead, the data suggested some bottom-up input in
tion could reflect a restriction of freedom account, if, as the form of direct relations among the types of satisfac-
the synthesists assumed, high SES women with infants tion themselves. However, the synthesists noted that their
experience the greatest changes with the onset of mother- use of job and marital satisfaction as proxy variables for
hood, because their time becomes more restricted. With actual job and marital outcomes threatens the conceptual
this failure of the meta-analytic data to arbitrate between validity of their bottom-level measures. As a result, only
the two competing theories, it is unclear how to interpret low certainty can be accorded the conclusions based on
mothers SES as it influences marital satisfaction. Given the effects of these substitutes for real life events, and a
this threat to the conceptual validity of the SES measure, more definitive test of the theory awaits more optimal op-
any one account needs to be accorded a relatively low erationalizations. Additional uncertainty in the conclu-
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 469

sions arises from reliance on cross-sectional correlational positions and success in specific domains. The analysis of
studies. Such designs provide little basis for inferring longitudinal designs provided clearer evidence of causal
causality and do not rule out the possibility that life satis- direction by demonstrating that happiness precedes suc-
faction and life events influence personality, despite the cess at work life, social relationships, and physical well-
synthesists claim that personality is to a large extent in- being. The analysis of experimental designs potentially
nate and unlikely to be influenced by bottom-up events in could yield even greater certainty about causality, though
this manner. Thus, this threat to internal validity further the findings from this paradigm were mixed, with experi-
lowers certainty in the synthesis conclusions. mental inductions of happy mood associated with greater
sociality and health but not always more effective prob-
24.4.2 Strategies Given Low Certainty About lem solving. Lyubormirsky and her colleagues use of
Competing Theories these study designs increased the certainty of their con-
clusions concerning the top-down influences of disposi-
The appropriate strategy to follow up a competitive the- tions on life outcomes of sociality and health.
ory testing meta-analysis depends on where uncertainty In summary, subsequent primary research or meta-
emerges. As illustrated in Jean Twenge and her colleagues analyses can be devised that use research designs with
investigation of the decrease in marital satisfaction with high internal validity to address the uncertainty in the
parenthood, the moderating variables predicting reduced causal direction underlying predicted relations.
marital satisfaction in the synthesis could plausibly be
interpreted in terms of multiple theories (2003). Thus, 24.5 CERTAIN AND UNCERTAIN FINDINGS DIRECT
follow-up meta-analytic or primary research is needed to SUBSEQUENT RESEARCH
assess directly the moderating and mediating variables
uniquely indicative of each theoretical account. The scientific contributions of research syntheses with
Uncertainty in theory testing also arises from correla- highly certain findings are widely acknowledged. Syn-
tional designs that do not provide enough evidence of the theses that produce highly certain research findings and
direction of the causal relation among theoretical con- theoretical accounts contribute to the canon of known sci-
structs. For example, Ouellette and Woods meta-analysis entific facts and theoretical models. These syntheses can
used path models to differentiate between habits and in- be considered foundational in that they provide a base of
tentions as predictors of future behavior (1998). To ad- accepted knowledge from which to launch new empirical
dress the potential threat to validity, Jeffrey Quinn, An- investigations and develop new explanatory frameworks.
thony Pascoe, David Neal, and Wendy Wood established A less well recognized role for research synthesis is to
habits through simple response repetition in an experi- identify points of uncertainty in empirical findings and
mental paradigm and were able to demonstrate the causal theoretical explanations. Synthesis findings contribute to
impact of habitual tendencies as opposed to intentions in the progression of science when they take stock of what is
guiding future action (2009). Considerable theoretical not yet known across a body of literature. This generative
certainty is established by this strategy of conducting ex- type of synthesis identifies uncertainties that can catalyze
periments to evaluate the causal mechanisms thought to subsequent empirical data collection and theory develop-
underlie correlational findings from a meta-analysis. ment. Generative syntheses provide guideposts indicating
Uncertain meta-analytic conclusions also were pro- what issues need subsequent investigation. To assume this
duced by the cross-sectional correlational design of Heller, role, meta-analyses need to clearly identify the sources of
Watson, and Iliess analysis of bottom-up versus top- uncertainty in the covered literature. Through the many
down models of life satisfaction (2004). It is interesting examples in this chapter, we identified particular threats
to compare this approach with Sonja Lyubomirsky, Laura to validity that generate uncertainty in empirical and theo-
King, and Ed Dieners related meta-analysis addressing retical accounts. Subsequent research and theoretical de-
the top-down effects of dispositional happiness on suc- velopment can then target these sources of uncertainty.
cess in specific life domains of marriage, friendship, and In contrast to these positive functions of the identifica-
work (2005). Lyubomirsky and her colleagues conducted tion of uncertainties, syntheses also could produce uncer-
several analyses, each synthesizing a specific research tain findings with little understanding of where the uncer-
design. The analysis of cross-sectional designs demon- tainty emerges. When uncertainty arises from ambiguous
strated the existence of relations between the happy dis- or unknown sources, syntheses fail as generative guides
470 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

to subsequent research and theorizing. For example, syn- ented toward identifying empirical relations between
theses that are conceptualized extremely broadly and that variables. False certainty in meta-analyses that competi-
include studies in a wide variety of paradigms may not tively test theories is likely to emerge when synthesists
have much success in predicting study outcomes because fail to acknowledge the limited conceptual validity of their
of the many specific considerations that determine the variables or the poor internal validity that might, for ex-
size of effects within particular paradigms. In addition, ample, impede identification of causal mechanisms in the
the wide range of possible theoretical explanations for theoretical analysis.
such diverse effects can make it difficult to test any par- Syntheses that falsely emphasize certainty can discour-
ticular theoretical model for the findings. To produce re- age subsequent investigation and distort the scientific
sults that highlight specific sources of uncertainty, smaller, knowledge base. To the extent that these problems occur,
subsidiary meta-analyses examining predictors and me- quantitative synthesis fails to fulfill its promise of further-
diators specific to each experimental paradigm would be ing progress toward the twin goals of ascertaining em-
necessary. pirical relations and determining accurate theoretical ex-
The two roles that syntheses can play with respect to planations for these relations.
subsequent research, providing both a foundation of ac-
cepted knowledge and a generative impetus to further in- 24.6 CONCLUSION
vestigation, have important implications for the way that
research findings are portrayed. If scientific payoff through By emphasizing the varying levels of certainty that the
publication and citation is believed to flow primarily from scientific community should accord to the conclusions of
highly certain empirical findings and theoretical explana- syntheses, we encourage the authors and the consumers
tions, synthesists may fail to capitalize on the generative of meta-analyses to adopt a critical stance in relation to
role that syntheses can play in highlighting promising av- research syntheses. Syntheses should be critically evalu-
enues for subsequent research. This role is not met through ated with respect to specific goals. We differentiated be-
the conventional call for additional research that can be tween the goal of establishing particular relations and the
found at the end of many published studies. Instead, gen- goal of evaluating mediators and moderator of those rela-
erative syntheses systematically assess the points of inva- tions. We noted that fulfilling either of these goals would
lidity in an existing literature and identify the associated allow meta-analyses to compare the adequacy of theo-
uncertainty with which scientists should accept empirical retical accounts for phenomena, including testing them
findings and theoretical accounts. We have argued for the competitively.
importance of generative syntheses throughout this chap- By contributing to foundational knowledge in a disci-
ter, and we note that such analyses already appear in top pline and by generating new avenues for subsequent re-
journal outlets in psychology. Illustrating such a synthe- search, syntheses do not represent the final phase of a
sis is the Oyserman, Coon, and Kemmelmeier analysis of fields consideration of a hypothesis. Instead, the two goals
the limited construct validity of individualism and col- of syntheses we considered represent potentially helpful
lectivism in cultural psychology (2002). aids to sifting through and evaluating evidence of empiri-
The belief that syntheses are evaluated primarily for cal relations and theoretical explanations.
producing highly certain findings might also encourage We intend for this chapter to encourage the authors of
synthesists inappropriately to portray their work as pro- research syntheses to provide careful analyses of the cer-
ducing clear outcomes that definitively answer research tainty with which they accept empirical and theoretical
questions. Conclusions that should be drawn with consid- conclusions. This chapter also promotes syntheses gen-
erable uncertainty, because of either insufficient empiri- erative role by suggesting areas of subsequent empirical
cal documentation or unclear theoretical accounts, could research and theory development. Perhaps the perception
be portrayed instead as highly certain. In the earlier edi- that journal publication requires clear-cut findings has
tion of this handbook, we suggested that false certainty is sometimes encouraged the authors of syntheses to an-
conveyed when synthesists fail to test for homogeneity of nounce their conclusions more confidently than meta-
effects or when they fail to consider adequately the impli- analytic evidence warrants. To counter this perception,
cations of rejection of homogeneity for the effects they we encourage journal editors to foster appropriate explo-
are summarizing (Eagly and Wood 1994). These prob- ration of uncertainty and to favor publication of research
lems are especially likely to occur in meta-analyses ori- syntheses that do not obscure or ignore the uncertainty
ADVANTAGES OF CERTAINTY AND UNCERTAINTY 471

that should be applied to their conclusions. Careful analy- Research Synthesis, edited by Harris Cooper and Larry V.
sis of the certainty that should be accorded to the set of Hedges. New York: Russell Sage Foundation.
conclusions produced by a meta-analysis fosters a healthy Greenwald, Anthony G. 1975. Consequences of Prejudice
interaction between research syntheses and future pri- Against the Null Hypothesis. Psychological Bulletin 82(1):
mary research and theory development. 120.
Harris, Alex H.S. 2006. Does Expressive Writing Reduce
24.7 REFERENCES Health Care Utilization? A Meta-Analysis of Randomized
Trials. Journal of Consulting and Clinical Psychology
Byrnes, James P., David C. Miller, and William D. Schafer. 74(2): 24352.
1999. Gender Differences in Risk Taking: A Meta-Analy- Hedges, Larry V., and Ingram Olkin. 1985. Statistical
sis. Psychological Bulletin, 125(3): 36783. Methods for Meta-Analysis. Orlando, Fl.: Academic Press.
Cohen, Jacob. 1988. Statistical Power Analysis for the Beha- Hedges, Larry V., and Therese D. Pigott. 2001. The Power
vioral Sciences. Hillsdale, N.J.: Lawrence Erlbaum. of Statistical Tests in Meta-Analysis. Psychological
Cooper, Harris M., and Robert Rosenthal. 1980. Statistical Methods 6(3): 20317.
Versus Traditional Procedures for Summarizing Research Heller, Daniel, David Watson, and Remus Ilies. 2004. The
Findings. Psychological Bulletin, 87(3): 44249. Role of Person Versus Situation in life Satisfaction: A Criti-
Cutler, Brian L., and Steven D. Penrod. 1988. Improving the cal Examination. Psychological Bulletin 130(4): 574600.
Reliability of Eyewitness Identification: Lineup Construc- Hofstede, Geert. 1980. Cultures Consequences. Beverly
tion and Presentation. Journal of Applied Psychology 73(2): Hills, Calif.: Sage Publications.
28190. Jennions, Michael D., and Anders P. Mller. 2002. Rela-
Cutler, Brian L., Steven D. Penrod, and Todd K. Martens. tionships Fade with Time: A Meta-Analysis of Temporal
1987. Improving the Reliability of Eyewitness Identifica- Trends in Publication in Ecology and Evolution. Procee-
tion: Putting Context into Context. Journal of Applied Psy- dings of the Royal Society of London 269: 4348.
chology 72(4): 62937. Kenny, David. A., Deborah Kashy, and Nils Bolger. 1998.
Dauchet, Luc, Philippe Amouyel, and Jean Dallongeville. Data Analysis in Social psychology. In Handbook of So-
2005. Fruit and Vegetable Consumption and Risk of Stroke: cial Psychology, 4th ed., edited by Daniel Gilbert, Susan
A Meta-Analysis of Cohort Studies. Neurology 65(8): Fiske, and Gordon Lindzey. New York: McGraw-Hill.
119397. Kromrey, Jeffrey D., and Gianna Rendina-Gobioff. 2006.
Durantini, Marta R., Dolores Albarracn, Amy L. Mitchell, On Knowing What We Do Not Know: An Empirical Com-
Allison N. Earl, and Jeffrey C. Gillette. 2006. Conceptua- parison of Methods to Detect Publication Bias in Meta-
lizing the Influence of Social Agents of Behavior Change: Analysis. Educational and Psychological Measurement
A Meta-Analysis of the Effectiveness of HIV-Prevention 66(3): 35773.
Interventionists for Different Groups. Psychological Bul- LaFrance, Marianne, Marvin A. Hecht, and Elizabeth L. Pa-
letin 132(2): 21248. luck. 2003. The Contingent Smile: A Meta-Analysis of
Eagly, Alice H., Serena Chen, Shelly Chaiken, and Kelly Sex Differences in Smiling. Psychological Bulletin 129(2):
Shaw-Barnes. 1999. The Impact of Attitudes on Memory: 30534.
An Affair to Remember. Psychological Bulletin 125(1): Latham, Gary P., Miriam Erez, and Edwin A. Locke. 1988.
6489. Resolving Scientific Disputes by the Joint Design of Cru-
Eagly, Alice H., Patrick Kulesa, Laura A. Brannon, Kelly cial Experiments by the Antagonists: Application to the
Shaw, and Sarah Hutson-Comeaux. 2000. Why Counte- Erez-Latham Dispute Regarding Participation in Goal Set-
rattitudinal Messages are as Memorable as Proattitudinal ting. Journal of Applied Psychology 73(4): 75372.
Messages: The Importance of Active Defense Against At- Lewis, Beth A., Bess H. Marcus, Russell R. Pate, and An-
tack. Personality and Social Psychology Bulletin 26(11): drea L. Dunn. 2002. Psychosocial Mediators of Physical
13921408. Activity Behavior Among Adults and Children. American
Eagly, Alice H., and Steven J. Karau. 1991. Gender and the Journal of Preventive Medicine 23(2S): 2635.
Emergence of Leaders: A Meta-Analysis. Journal of Per- Lipsey, Mark W., and David B. Wilson. 1993. The Efficacy
sonality and Social Psychology 60(5): 685710. of Psychological, Educational, and Behavioral Treatment:
Eagly, Alice H., and Wendy Wood. 1994. Using Research Confirmation from Meta-Analysis. American Psycholo-
Syntheses to Plan Future Research. In The Handbook of gist, 48(12): 11811209.
472 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

Lyubomirsky, Sonja, Laura King, and Ed Diener. 2005. The Sharpe, Donald, and Lucille Rossiter. 2002. Siblings of
Benefits of Frequent Positive Affect: Does Happiness Lead Children with a Chronic Illness: A Meta-Analysis. Journal
to Success? Psychological Bulletin 131(6): 80355. of Pediatric Psychology 27(8): 699710.
McGuire, William J. 1968. Personality and Susceptibility Sun, Can-Lan, Jian-Min Yuan, Woon-Puay Koh, and Mimi C.
to Social Influence. In Handbook of Personality Theory Yu. 2006. Green Tea, Black Tea and Colorectal Cancer
and Research, edited by E. F. Borgatta and W. W. Lambert. Risk: A Meta-Analysis of Epidemiologic Studies. Carci-
Chicago: Rand McNally. nogenesis 27(7): 130109.
Ng, Thomas W. H., Lillian T. Eby, Kelly L. Sorensen, and Twenge, Jean M. 2001. Changes in Womens Assertiveness
Daniel C. Feldman. 2005. Predictors of Objective and in Response to status and Roles: A Cross-temporal Meta-
Subjective Career Success: A Meta-Analysis. Personnel Analysis, 19311993. Journal of Personality and Social
Psychology 58(2): 367408. Psychology 81(1): 13345.
Ouellette, Judith A., and Wendy Wood. 1998. Habit and In- Twenge, Jean M., W. Keith Campbell, and Craig A. Foster.
tention in Everyday Life: The Multiple Processes by Which 2003. Parenthood and Marital Satisfaction: A Meta-Ana-
Past Behavior Predicts Future Behavior. Psychological lytic Review. Journal of Marriage and the Family 65(3):
Bulletin 124(1): 5474. 57483.
Oyserman, Daphne, Heather M. Coon, and Markus Kem- Van Ijzendoorn, Marinus H., Femmie Juffer, and Caroline
melmeier. 2002. Rethinking Individualism and Collecti- W.K. Poelhuis. 2005. Adoption and Cognitive Develop-
vism: Evaluation of Theoretical Assumptions and Meta- ment: A Meta-Analytic Comparison of Adopted and Nona-
Analyses. Psychological Bulletin 128(1): 372. dopted Childrens IQ and School Performance. Psycholo-
Quinn, Jeffrey M., Anthony Pascoe, David Neal, and Wendy gical Bulletin 131(2): 30116.
Wood. 2009. Strategies of Self-Control: Habits Are Not Wallace, David S., Ren M. Paulson, Charles G. Lord, and
Tempting. Manuscript under editorial review. Charles F. Bond Jr. 2005. Which Behaviors do Attitudes
Rhodes, Nancy D., and Wendy Wood. 1992. Self-Esteem Predict? Meta-Analyzing the Effects of Social Pressure and
and Intelligence Affect Influenceability: The Mediating Perceived Difficulty. Review of General Psychology 9(3):
Role of Message Reception. Psychological Bulletin 111(1): 21427.
15671. Weisz, John R., Amanda J. Doss, and Kristin M. Hawley.
Richard, F. D., Charles F. Bond, Jr., and Juli J. Stokes-Zoota. 2005. Youth Psychotherapy Outcome Research: A Review
2003. One Hundred Years of Social Psychology Quantita- and Critique of the Evidence Base. Annual Review of Psy-
tively Described. Review of General Psychology 7(4): chology 56: 33763.
33163. Wells, Brooke E., and Jean M. Twenge. 2005. Changes in
Ritter, Barbara A., and Janice D. Yoder. 2004. Gender Dif- Young Peoples Sexual Behavior and Attitudes, 19431999:
ferences in Leader Emergence Persist Even for Dominant A Cross-Temporal Meta-Analysis. Review of General Psy-
women: An Updated Confirmation of Role Congruity chology 9(3): 24961.
Theory. Psychology of Women Quarterly 28(3): 18793. Wilson, David B., and Mark W. Lipsey. 2001. The Role of
Shadish, William, R., Thomas D. Cook, and Donald T. Cam- Method in Treatment Effectiveness Research: Evidence
pbell. 2002. Experimental and Quasi-Experimental Desi- from Meta-Analysis. Psychological Methods 6(4): 41329.
gns for Generalized Causal Inference. Boston, Mass.: Wood, Wendy, and Jeffrey M. Quinn. 2003. Forewarned
Houghton Mifflin. and Forearmed? Two Meta-Analytic Syntheses of Forewar-
Shapiro, Peter N., and Steven D. Penrod. 1986. Meta-Ana- nings of Influence Appeals. Psychological Bulletin 129(1):
lysis of Facial Identification Studies. Psychological Bulle- 11938.
tin 100(2): 13956.
25
RESEARCH SYNTHESIS AND PUBLIC POLICY

DAVID S. CORDRAY PAUL MORPHY


Vanderbilt University Vanderbilt University

CONTENTS
25.1 Introduction 474
25.2 Public Policy, Policy Makers, and the Policy Context 474
25.2.1 Who Are the Policy Makers? 475
25.2.2 Competing Agendas and Perspectives 476
25.3 Synthesis in Policy Arenas 476
25.3.1 Some Historical Trends 477
25.4 Potential Biases 478
25.4.1 The Environmental Protection Agency and Secondhand Smoke 478
25.4.2 National Institute of Education, Desegregation, and Achievement 479
25.4.3 National Reading Panel and Phonics 479
25.5 Evidence-Based Practices: Fuel for Policy-Oriented Meta-Analyses 480
25.5.1 Organizational Foundations for Evidence-Based Practices 480
25.5.1.2 Cochrane Collaboration 480
25.5.1.3 Campbell Collaboration 481
25.5.1.3 IESs What Works Clearinghouse 481
25.5.1.4 Organizations Around the World 481
25.5.2 Additional Government Policies and Actions 481
25.6 General Types of Policy Syntheses 482
25.6.1 Policy-Driven Research Syntheses 482
25.6.2 Investigator-Initiated Policy Syntheses 483
25.7 Minimizing Bias and Maximizing Utility 483
25.7.1 Principles for Achieving Transparency 483
25.7.2 Question Specification 484
25.7.2.1 Optimizing Utility 484
25.7.3 Searches 485
25.7.4 Screening Studies 485

473
474 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

25.7.5 Standards of Evidence 486


25.7.5.1 Level of Evidence Approach 486
25.7.5.2 Segmentation by Credibility or Strength of Evidence 487
25.7.5.3 Best Evidence Syntheses 487
25.7.5.4 Meta-Regression Adjustment Approach 487
25.7.5.5 Combined Strategies 488
25.7.6 Moderator Variables 488
25.7.7 Reporting and Communicating Results 488
25.8 Closing Remarks and Summary 489
25.9 Acknowledgments 489
25.10 Notes 490
25.11 References 490

Raising the drinking age has a direct effect on reducing alcohol-related traffic accidents
among youth affected by the laws, on average, across the states.
U.S. General Accounting Office (1987)1

More time spent doing homework is positively related to achievement, but only for mid-
dle and high school students.
Harris Cooper, Jorgianne Robinson, and Erika Patall (2006),
Harris Cooper and Jeffrey Valentine (2001)

Radiofrequency denervation can relieve pain from neck joints, but may not relieve pain
originating from lumbar discs, and its impact on low-back pain is uncertain.
Leena Niemisto, Eija Kalso, Antti Malmivaara,
Seppo Seitsalo, and Heikki Hurri (2003)

25.1 INTRODUCTION phrase public policy, highlighting its breadth and who
might be involved. The discussion concludes with a char-
These three statements are based on some form of research acterization of how research synthesis practices and results
syntheses. Each was motivated by different interests but all can be influenced by the political processes associated with
share a purposethey were conducted to inform policy public policy making. Stated bluntly, politics can affect the
makers and practitioners about what works. Rather than way that analysts plan, execute, and report the findings of
relying on personal opinion, a consensus of experts, or their studies. The end products represent a mixture of evi-
available research, each statement is based on the results of dence and political ideology, thereby minimizing the value
a systematic, quantitatively based synthesis. The purpose of what is supposed to be an orderly and relatively neutral
of this chapter is to describe the past, present and future view of evidence on a particular policy topic.
roles of research synthesis in public policy arenas like the
ones represented in our three examples. Because meta- 25.2 PUBLIC POLICY, POLICY MAKERS,
analysis is the chief research synthesis method for quanti- AND THE POLICY CONTEXT
tatively integrating prior evidence and is an essential focus
of this handbook we do not dwell on the technical aspects It is relatively easy to recognize when statements, opin-
of meta-analysis. Instead, we focus on contemporary fea- ions, and actions could be construed to represent instances
tures of the policy context and the potential roles of re- of public policy action. However, the phrase public policy
search synthesis, especially meta-analysis, as a tool for is amorphous. Strangely enough, it is rarely defined, even
novice, journeyman, and expert policy analysts. In doing in books devoted to the topic. As such, we take a common
so, we briefly describe what is generally meant by the sense approach to its definition by turning to the Merriam-
RESEARCH SYNTHESIS AND PUBLIC POLICY 475

Webster dictionary for guidance. Looking at the diction- maker, with a single or few actors making decisions about
ary definitions of each term separately, we find that the the fate of programs (for example, the president, mem-
term public is defined as of, relating to, or affecting the bers of congress). Other views are more egalitarian about
people as a whole and the term policy is defined as a the composition of the policy community. For example,
definite course or method of action selected to guide and Lee Cronbach notes: In a politically lively situation . . .
determine present and future decisions. A secondary def- a program becomes the province of a policy-shaping com-
inition indicates that policy entails wisdom in the man- munity, not of a lone decision maker or tight-knit group
agement of affairs. Putting these terms together, along (1982, 6, emphasis added).
with explicit and implicit meanings behind their defini- In addition, the checks and balances built into demo-
tions, public policy refers to a deliberate course of action cratic governing structures in the United States and else-
that affects the people as a whole, now and in the future. where introduce a dizzying interplay of policy makers
Public policies come in many shapes, sizes, and forms. and actors within these policy shaping communities. Of-
They can range from broad policies such national medical ficials within the executive, legislative, and judicial sec-
insurance to constrained or small-scale policies such as a tors of federal, state, and local governments enact, im-
city ordinance to curb littering. Further, many broad poli- plement, monitor, and revise policies. These actions
cies unleash a cascade of smaller p policies. For ex- stem from laws, regulations, executive orders, and judi-
ample, provisions in the 2001 reauthorization of Title I of cial opinions, to name a few. However, as we will see,
Elementary and Secondary Education Act (known popu- the array of policy actors is not limited to officials in
larly as No Child Left Behind) call for states to use scien- these three branches of government. The private sector,
tifically based educational programs and practices. In including businesses, foundations, associations, and the
turn, this law required additional smaller policy decisions public at large, often play key roles in policy decisions.
such as defining what qualifies as scientific evidence. In turn, these unofficial policy actors can be essential
The origins of policies can result from partisan sea to the specification of research questions within meta-
changes or bipartisan collaboration. However, more often analytic studies, affecting what is and is not examined,
than not, they are the result of incremental changes and how, and why.
adjustments in existing public policies. These adjust- The definition of a policy-shaping community for Cron-
ments are likely to be the result of interactions among an bach is left to the particular policy circumstance. How-
elaborate network of policy actors. In a common sce- ever, officials at the Agency for Healthcare Research
nario, policy makers (for example, members of the legis- Quality (AHRQ, formerly the Agency for Health Care
lature) are likely to be approached by other members of Policy and Research, or AHCPR), opened the doors widely
the policy shaping communities (for example, lobbyists in defining the policy-shaping community associated with
and special interest groups) about the merits or demerits its twelve evidence-based practice centers. According to
of impending courses of action. Those seeking to influ- AHRQ, these centers were charged with developing
ence a primary policy maker may use empirical evidence
in supporting their positions. This is often an ongoing evidence-based reports and technology assessments on
process, with input from multiple sources. At some point topics relevant to clinical, social/behavioral, economic, and
in this process, differences of opinion become evident, other health care organization and delivery issuesspecifi-
and the policy-relevant question becomes who is right cally those that are common, expensive, and/or significant
about what works. For example, before reauthorization for the Medicare and Medicaid populations. With this pro-
gram, AHRQ became a science partner with private and
hearings, the credibility of the evidence underlying the
public organizations in their efforts to improve the quality,
positions of proponents and opponents became key to the effectiveness, and appropriateness of health care by synthe-
fate of the Special Supplemental Food Program for sizing and facilitating the transition of evidence-based re-
Women, Infant and Children (or WIC) program. These search findings. Topics are nominated by non-federal part-
were the policy questions that framed one of GAOs first ners such as professional societies, health plans, insurers,
policy-related syntheses (GAO 1984). employers, and patient groups. (2006, 1)

25.2.1 Who Are the Policy Makers? It is significant that the AHRQ statement appears to
redefine the conventional notion of who are policy mak-
The foregoing discussion provides a classic view of pol- ers. The quote strongly emphasizes partnerships among
icy making and implies an aristocratic view of the policy public and private organizations and citizens. Affirming
476 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

this broader representation of policy makers, David Sack- techniques in a public policy context. Routine use of
ett, William Rosenberg, Muir Gray, Brian Haynes, and meta-analytic techniques, in general, did not appear until
Scott Richardson refer to evidence-based medicine as a after classic papers by Gene Glass and Richard Light and
hot topic for clinicians, public health practitioners, pur- Paul Smith were published (Glass 1976; Light and Smith
chasers, planners, and the public (1996, 71). The current 1974). Some policy-oriented analysts did not waste much
interest in evidence-based practice shatters the old notion time adopting and adapting Gene Glasss methods of
of policy making as a top-down enterprise by broadly de- meta-analysis to policy-oriented topics. As such, various
fining potential users of research and syntheses. Stated forms of quantitative research synthesis have been used
differently, policy-oriented research syntheses can involve within the policy context for nearly three decades (see
public policy issues that are intended for consumption by Cordray and Fischer 1995; Hunt 1997; Wachter and Straf
an array of public- and private-sector interests. 1990).
A number of books are particularly enlightening about
25.2.2 Competing Agendas and Perspectives the emergence of meta-analysis and research synthesis in
policy areas. The first was a joint venture between the
Public policy entails such fundamental issues as how to National Research Councils Committee on National Sta-
allocate resources, how to protect the public from natural tistics (CNSTAT) and the Russell Sage Foundation (RSF).
and man-made harm, how to promote health, education, The volume, titled The Future of Meta-Analysis and ed-
and welfare, and how to equalize opportunities. Whereas ited by Kenneth Wachter and Miron Straf, is the result of
Cronbachs image of the policy shaping community ap- a workshop to examine the strengths and weaknesses of
pears polite and orderly, the actual process is likely to meta-analysis, with particular attention to its application
be less rational. In arriving at answers to these questions, to public policy areas (1990). The workshop was con-
the collection of actors within policy shaping communi- vened by CNSTAT when it became apparent that the
ties (legislators, judges, lobbyists, and policy analysts) translation of meta-analytic methods to policy areas may
can, and often do, bring to the table a host of conflicting require additional vigilance over the integrity of all phases
views, ideologies, and agendas. As a result of these dif- of the research process.
ferences, public policy actions often represent negotiated The second book, How Science Takes Stock: The Story
settlements across these multiple perspectives, but unilat- of Meta-Analysis, also sponsored by the Russell Sage
eral, mandated policies can also be an outcome of the Foundation, was written by Morton Hunt. In telling the
process. Admittedly, neither is the most rational means of early story of meta-analysis, Hunt chronicled numerous
optimizing public policy, but public policy is at the heart examples of how meta-analysis was used in examining
of democracy. the effects of policy topics in education, medicine, delin-
As much as we may favor democracy, inherent in the quency, and public health (1997). The book provides some
idea of collective action is the realization that conflicting rare and interesting behind-the-scenes accounts of these
views, ideologies, and agendas bring with them the op- early applications. Many of these accounts are derived
portunities for the introduction of biases. Meta-analysts from candid interviews with authors of meta-analytic
are not immune to the influence of biases introduced by studies sponsored by the foundation and summarized in
their own predispositions or those of the organization Meta-Analysis for Explanation (Cook et al. 1992).
within which they work (Wolf 1990). We return to this Richard Light and David Pillemer provided a general
issue after an overview of research synthesis within the overview of the science of synthesizing research (1984).
policy context, in the next section. However, their introduction set the stage for the use of
these techniques in policy areas. The approach to the topic
25.3 SYNTHESIS IN POLICY ARENAS also is readily seen in GAOs early edition of their report
on the Evaluation Synthesis methodology (see Cordray
The statistical roots of meta-analysis date back to the 1990; GAO 1992). Other books that highlight the use of
early 1900s, when Karl Pearson attempted to aggregate research synthesis in policy-like contexts include Harris
the results of statistical studies on the effectiveness of a Cooper (1989), John Hunter and Frank Schmidt (1990),
vaccine against typhoid (1904). Given that the vaccine and John Hunter, Frank Schmidt, and Gregory Jackson
was to be given to British soldiers at risk of typhoid fever, (1982). Harris Cooper and Larry Hedges, whose work
this could be regarded as the first use of meta-analytic the Russell Sage Foundation also supported as part of its
RESEARCH SYNTHESIS AND PUBLIC POLICY 477

efforts to infuse meta-analysis into the policy community 14000


and beyond, provided a technical and practical approach
to research synthesis (1994). In addition, as described 12000 Policy only

Web of Science Citations


later in some detail, the Cochrane Collaboration repre-
10000 Meta-analysis only
sents a major driving force behind contemporary interest
in research synthesis in the public sector (www.cochrane Meta-analysis + policy
.org). Not only have many systematic reviews been con- 8000
ducted under the auspices of the Cochrane Collaboration,
6000
its organizational structure, procedures, and policies have
served as an important prototype for other organizations 4000
such as the Campbell Collaboration and IESs What
Works Clearinghouse (www.campbellcollaboration.org; 2000
www.ed.gov/ies/whatworks). Without meaning to over-
state the case, it could be argued that the cumulative ef- 0
fect of these varied resources have significantly facilitated

0
2
4
6
8
0
2
4
6
8
0
2
4
6
198
198
198
198
198
199
199
199
199
199
200
200
200
200
the use of quantitative methods in policy making endeav-
ors. The next section takes a broad look at the role of Year
syntheses in such efforts.
Figure 25.1 Comparison of Publications
source: Authors compilation.
25.3.1 Some Historical Trends
How prevalent are policy-oriented quantitative syntheses
such as meta-analyses?2 Although there are no official Despite the growth in published studies of meta-analysis,
registries of policy-oriented quantitative syntheses, we in general, and studies directed at policy issues, the ex-
attempted to examine their popularity by examining data plicit presence of meta-analysis within policy studies is
bases on published and unpublished literature, searching strikingly low. We found no references to policy and
the World Wide Web, and looking at specialized archives meta-analysis in the Web of Science searchers before
such as the Congressional Record. As seen in this section, 1990; by 2006, the joint use of these keywords increased
policy-oriented research syntheses (more specifically, to only seventy-one references. We are aware that restrict-
meta-analyses) are not easily identified by conventional ing the search to the term meta-analysis per se probably
search methods. For example, a search of thirty-four elec- underrepresents the true presence of meta-analytic-type
tronic data bases available through the Vanderbilt Univer- syntheses in policy areas. On the other hand, the omission
sity libraries identified fewer than fifty entries when the of either term suggests that the potential policy analytic
search terms were meta-analysis and public policy. Other role of meta-analysis has not yet taken a firm hold in the
searches produced similar results. hearts and minds of analysts.
Figure 25.1 shows three trend lines for references to Other evidence of the infancy of meta-analysis in the
policy, meta-analysis, and the joint occurrence of policy macro-policy context is evident from the infrequent men-
and meta-analysis in the Web of Science database for tion of meta-analysis per se in congressional hearings. A
1980 to 2006.3 Given the relative youth of research syn- LexisNexis search of the Congressional Record of hear-
thesis and, in particular, meta-analysis, it is not surprising ings from 1976 to 2006 showed a modest presence of the
to see that before 1990, fewer than seventy-five meta-ana- term in hearings. At its peak in 1998, it was mentioned in
lytic studies were captured in this bibliometric archive.4 about sixty hearings. Given that there are, on average,
After 1990, the growth in number of references was dra- about 1,500 hearings per year, meta-analysis was men-
matic; by the year 2006, more than 3,000 references to tioned at most in about 4 percent of the hearings.
meta-analysis were identified. A parallel growth is seen in Although there has been interest in the use of meta-
the number of references to studies, syntheses, and other analysis and other forms of quantitative synthesis (like
publications pertaining to policy; as with meta-analysis, GAOs evaluation synthesis) in policy areas, its presence
the peak year for references to policy is 2006. In that year is not dominant, as indexed in readily available resources
alone, a total of more than 12,000 references were found. such as the Web of Science and the Congressional Record.
478 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

As previously noted, this absence could be attributed to that ETS is a Group A human carcinogen that causes
the relative youth of meta-analysis, but it also reflects a the cancer deaths of 3,000 Americans annually (1992).
lack of policy-identity on the part of meta-analyst. Con- Using the Radon Gas and Indoor Air Quality Research
temporary meta-analysts may see their work contributing Act of 1986 as justification for its study, EPA analysts
to the knowledge-base as the primary goal; as a second- and expert consultants summarized available studies
ary goal, it is available to others who may want to use the through September of 1991. To estimate the risk of ETS
results in the policy process. In later sections, we will see on health outcomes, the EPA synthesized reports from
that there is a great deal of work being done with research twenty-seven case-control studies and four prospective
synthesis methods in policy areas. We get the impression cohort studies conducted across eight countries. These
that meta-analysis is on the brink of becoming a major studies examined the effects of spousal smoking on lung
part of the evidence base underlying important questions cancer rates for nonsmoking females married to smok-
about which programs and policies work (or not). ing spouses. Combining the results of epidemiological
studies resulted in a relative risk (RR) ratio of 1.19. The
90 percent confidence interval for this estimate did not in-
25.4 POTENTIAL BIASES clude 1.0, so EPA declared the RR as statistically signifi-
cant. In combination with their analysis of bioplausibil-
Although minimizing the influence of bias in all phases
ity, EPA argued that the significant RR ratio allowed them
of the research process is a common goal for all investi-
to assert that secondhand smoke was a cause of adverse
gators, Wachter and Straf and Frederic Wolf reminded us
health outcomes. On release, this report had an immedi-
of the need to minimize biases in: framing the questions,
ate effect on policy. Actions included introduction of the
selecting studies for inclusion, decisions regarding analy-
Ban on Smoking in Federal Buildings Act and the intro-
sis, and reporting the results (Wachter and Straf (1990;
duction of the Smoke-Free Environment Act (HR 3434),
Wolf 1990). David Cordray identified specific biases that
eliminating smoking in public, nonresidential buildings.
can enter into these phases of the research process (1994).
The EPA ETS study was the subject of at least five
Deliberate efforts to minimizing bias are especially im-
congressional hearings, received substantial criticism, and
portant when social science methods are used to answer
resulted in an extended legal battle. Among the extended
public policy questions. In fact, as described later, the im-
list of objections, critics first insisted that EPA altered its
portance of minimizing several forms of bias has resulted
criteria for statistical significance, using a 90 percent con-
in the development of detailed procedures and policies
fidence interval, when an earlier statistical analysis using
for the conduct of policy-oriented syntheses by major or-
the conventional 0.05 (two-tailed) failed to reject the
ganizations that promote the use of syntheses to inform
null hypothesis. Here, opponents insisted that the EPA
policy makers and practitioners about what works, such
bent the rules of analysis to fit its own ends (Gori 1994b)
as the Cochrane Collaboration. In this section, we present
or stated that they performed an unorthodox reanalysis
an extended discussion of an extraordinary example of
(Coggins 1994). Second, critics questioned the adequacy
how competing agendas and their associated biases can
of EPSs assessment and management of confounding
play-out when meta-analysis is conducted on a highly
variables (Coggins 1994; Gravelle and Zimmerman 1994b;
volatile public policy topic. The specific case involves the
Redhead and Rowberg 1998). One analyst even suggested
Environmental Protection Agencys (EPA) analysis of the
that control for confounds was impossible, given the non-
effects of secondhand smoke on health outcomes. Other
experimental nature of the primary studies (Gori 1994b).
examples reinforce the idea that politics and ideologies
It is significant that critics accused the EPA of cherry-
can affect synthesis practices.
picking data (studies) to support a ban on smoking in the
work place, that it, the EPA, ignored or only partially re-
25.4.1 The Environmental Protection Agency ported disaffirming findings (Coggins 1994; Gori 1994a;
and Secondhand Smoke Gravelle and Zimmerman 1994a). The EPA report failed
to mention the absence of findings for work place risk or
In 1992, the Environmental Protection Agency issued a cancer risk, and the study results pertain only to exposure
report that causally linked exposure to environmental to- in the home. In an effort to sort out the merits of these
bacco smoke (ETS, or secondhand smoke) with the in- criticisms, Congress asked the Congressional Research
creased risk of lung cancer. The EPA concluded pointedly Service to examine the EPA report; they agreed with prior
RESEARCH SYNTHESIS AND PUBLIC POLICY 479

criticisms and raised numerous additional problems with workable process for public comment and involvement,
EPAs ETS study (1995). along with adherence to predefined procedures (12).
After several rounds of criticisms and rebuttals EPA In this example, EPA had a stake in the outcome of the
found itself in an extended legal battle with an important synthesis. As documented in the Judge Osteems opin-
member of the policy-shaping communitythe tobacco ion, important procedural and technical decisions were
industry (see Farland, Bayard, and Jinot 1994; Jinot and influenced by EPAs policy agenda. For our purposes, the
Bayard 1994). Ultimately, a federal district judgeJudge EPA-ETS example shows that meta-analytic studies can
William Osteeminvalidated the study because of pro- serve as an integral part of the policy-making process.
cedural and technical flaws (1998). On the procedural However, it also shows that it can be subverted in the
side, Osteem concluded in his ruling that EPA came to its policy arena. The lesson to be drawn from this example is
conclusion about the danger of secondhand smoke long that checks and balances are needed to create a firewall
before, in 1989, they collected and synthesized the avail- between the science of research synthesis and the policy
able empirical evidence. Further, in his ruling, Judge Os- process.
teem concluded that EPA did not follow the law [the Radon
Research Act] which requires them to seek the advice 25.4.2 National Institute of Education,
of a representative committee during the research (89). Desegregation, and Achievement
He explicitly noted that because the tobacco industry was
not represented, the assessment would very possibly not In less dramatic examples, the effects of an analysts per-
have been conducted in the same manner nor reached the spective on research synthesis (and meta-analytic) results
same conclusions (89). On the technical side, the judge also is evident in the case of syntheses of the effectiveness
relied on published criticisms, using classic social science of desegregation on student achievement. For example,
notions, that EPA disregarded information and made the (then) National Institute of Education became wor-
findings on selective information. EPA deviated from its ried about the radically different conclusion that experts
own Risk Assessment Guidelines by failing to disclose in desegregation came to after synthesizing the literature.
important findings and reasoning; and left significant ques- Because of these disparities, in 1982 the National Insti-
tions without answers (88). Finally, changing the sig- tute of Education (NIE) established a panel of six deseg-
nificance testing criteria from .05 to .10 was deemed in- regation experts to examine if meta-analytic methods
appropriate because it gave the impression that EPA was could be used to resolve prior differences among experts.
tailoring their analysis (adopting a more liberal standard) Although the conclusions showed less variability, the role
to support their 1989 position on the cancerous effects of of prior perspective was still evident. Jeffrey Schneider
secondhand smoke. provided an overview of the NIE project (1990). Harris
The Osteem ruling includes several points that high- Coopers research on the NIE panel provided direct evi-
light the EPAs flaunting of scientific conventions and dence of the influence of prior perspectives on meta-ana-
general wisdom. Large among them were adjusting sig- lytic practices and results (1986). That is, the meta-analytic
nificance a posteriori from .05 to .10 and excluding the results and conclusions presented by each expert closely
tobacco industry from the development of their review. As paralleled the individual positions that each had offered
a policy directed meta-analysis, these two decisions were in prior research and writing.
damaging, first, as an affront to the division of exploratory
and confirmatory analysis, and, second, by restricting the 25.4.3 National Reading Panel and Phonics
protective peer review process to those with whom one
already agrees. Proper peer-review alone would have cor- In another example, the National Reading Panel (NRP)
rected both problems. It is not surprising, therefore, that was charged with determining what practices work in
the OMB would institute policy guidelines addressing classroom reading instruction. One of the six areas exam-
these issues in government funded research (Office of ined was phonics instruction. Using meta-analysis meth-
Management and Budget 2004). Variously, the new policy ods, the researchers found that systematic phonics instruc-
demanded consideration beyond what the data show to tion was more effective than other instructional methods
how the data might be made use of, balancing of oversight (National Reading Panel 2000). Gregory Camilli, Sadako
committees, and not least that transparency and open- Vargas, and Michele Yurecko, however, took issue with
ness, avoidance of real or perceived conflicts of interest, a this finding (2003). They replicated and extended the NPR
480 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

meta-analysis, finding that the NRP study overlooked preference for the use of randomized control trials (RCTs)
other methods of instruction that were as or more effec- to assess the effects of what works (see Whitehurst 2003).
tive than systematic phonics instruction. Here, the NRPs This is a natural outgrowth of the interest in evidence-
narrow focus on a specific mode of instruction appears to based practices (EBP) because RCTs provide the most
have resulted in national policy about phonics that down- trustworthy basis for estimating the causal effects of pro-
played the importance of other instructional practices. gram, policies, and practices (Boruch 1997; Boruch and
Harris Cooper rightfully noted that the work of Camilli Mosteller 2001). For the purposes of understanding the
and others was guided by their own set of biases in defin- role of meta-analysis in policy arenas, we focus on the
ing the scope of their research (2005). However, there is role of the EBP movement.
an extensive literature on the so-called reading wars (see
Chall 1983, 1992, 2000) supporting the appropriateness 25.5.1 Organizational Foundations
of using the broader definitional framework used by Ca- for Evidence-Based Practices
milli and others (2003). In other words, even a quick syn-
thesis of the literature and the debates it contains would Evidence-based practice has become a popular refrain
have alerted members of the National Reading Panel that in the policy world (Cottingham, Maynard, and Stagner
there are different perspectives (and stakeholders) on what 2005). David Sackett and his colleagues traced the idea in
counts as reading and instructional practices. A broader medicine back to the mid-nineteenth century (1996). In
set of inclusion rules might have avoided some of the support of this interest, a network of public and private
criticism that the NRP received on its report. organizations has appeared on the policy scene to assist
These examples, to varying degrees, suggest that meta- in identifying relevant research, screening it for quality,
analysis methods, when pressed into public service, may synthesizing results across studies, disseminating results,
not always serve the public well. The next two sections and translating research evidence into practice guidelines.
focus on how the effects of politics, personal perspectives, The organizations we highlight here are dedicated to pro-
and methodological biases can be minimized. We focus moting the production of unbiased syntheses of the ef-
on three major organizations that have been instrumental fects of programs (interventions), practices (medical and
in devising procedures to maximize the quality of research otherwise), policies, and products. A premium is placed
syntheses, but before doing so, we describe the role of the on the integrity of the synthesis process. They have estab-
evidence-based practice movement as a dominant force lished procedures that make the processes of selecting
behind the use of research synthesis in policy areas. studies, evaluating their quality, and summarizing the re-
sults as transparent to the reader as is possible.
25.5 EVIDENCE-BASED PRACTICES: FUEL 25.5.1.1 Cochrane Collaboration At the forefront
FOR POLICY-ORIENTED META-ANALYSES of identifying practices that work was the Cochrane Col-
laboration (www.cochrane.org). Established in 1993, the
Interest in policy-oriented meta-analyses has been fueled Cochrane Collaboration, a nonprofit, worldwide organi-
by two recent and interrelated developments. The first zation named after epidemiologist Archie Cochrane, pro-
represents a broad-based desire on the part of profession- duces and disseminates systematic reviews of evidence
als (initially in medicine and now in other policy areas from randomized trials on the effects of healthcare inter-
such as education and justice), federal and state legisla- ventions (www.cochrane.org/docs/newcomersguide.htm).
tors, and agency executives (in both the public and pri- The productivity of Cochrane investigators (a largely vol-
vate sectors) to make the decisions about policies, pro- unteer workforce) has been nothing short of spectacular;
grams, products, and practices based on evidence of their as of this writing, 2,893 Cochrane reviews have been com-
effectiveness. Although this perspective has been referred pleted on an array of health-care topics. Because the Co-
under varying labels (scientifically based practices, evi- chrane Collaboration is well established, previous reviews
dence-based medicine, evidence-based policy), the core are regularly updated, providing consumers with the most
idea is that programs supported by federal funds should recent evidence on what works (or does not work) in hun-
be based on credible, empirical evidence. The second de- dreds of health-care subspecialties. At the time of writing,
velopment is related to the first in hat it defines what we the Cochrane Central Registry of Controlled Trials con-
mean by credible, empirical evidence. Following the lead tains an astonishing 479,462 entries. Beyond its stunning
from medicine, there has been a profound increase in the level of production, the Cochrane Collaboration serves as
RESEARCH SYNTHESIS AND PUBLIC POLICY 481

a model for the structure and operation of many organiza- ratings. In addition, results based on high quality quasi-
tions that followed in its footsteps. experiments (for example, regression-discontinuity de-
25.5.1.2 Campbell Collaboration Named after one signs) and well-executed single-subject (case) designs
of the most influential figures in social science research, also are viewed as meeting evidential standards. At this
Donald T. Campbell, the Campbell Collaboration (C2) stage, studies that meet standards and meet standards
was founded in 2000 (www.campbellcollaboration.org).5 with reservations are summarized, effect sizes are calcu-
As with the Cochrane Collaboration, it is composed of a lated, and an index of intervention improvement is calcu-
largely voluntary, worldwide network of researchers who lated. Intervention reports are prepared that document, in
direct their efforts at identifying effective social and edu- great detail, all inclusion and exclusion decisions, ratings
cational interventions. In light of concerns about the in- for standards of evidence, the data that are used in the
fluence of personal or organizational agendas, a well- calculation of effect sizes, and references to all docu-
structured synthesis process has been adopted. Synthesis ments considered in the synthesis are provided (sorted by
protocols are critiqued by experts to minimize oversights whether they meet or did not meet the standards of evi-
in question formation, searches and quality control. The dence). The third rating examines the strength of the evi-
Campbell Collaborations raison dtre is to produce re- dence across studies, for a given intervention. Consider-
search syntheses where the methodology is transparent, ations for these ratings entail a combination of design
that minimize the influence of conflicts of interest, and quality and the degree of consistency of results across
that are based on the best evidence (that is, its preference studies. To date, the WWC has produced thirty-one inter-
for and reliance on evidence from RCTs and high quality vention reports on several different topic areas. The depth
quasi-experiments). It shares these goals with its sibling and clarity of these reports are remarkable.
organizationthe Cochrane Collaboration. Both organi- 25.5.1.4 Organizations Around the World The Co-
zations have been the inspiration for the structure and chrane Collaboration has a network of centers like the
functioning of numerous public agencies, such as the In- Canadian Cochrane Network and Centre (http://cochrane
stitute of Education Sciences What Works Clearinghouse, .mcmaster.ca) located across the globe. Not surprising, in
established in 2002 (www.ed.gov/ies/whatworks) and pri- addition to the Cochrane Collaboration, there are numer-
vate organizations dedicated to promoting evidence-based ous public and private organizations with long standing
practices, such as the Coalition for Evidence-based Po- commitments to evidence-based practices and the use of
lices (www.evidencebasedprograms.org). systematic reviews to identify what works around the
25.5.1.3 IESs What Works Clearinghouse Estab- world. Edward Mullen has compiled a listing of web re-
lished in 2002, the What Works Clearinghouse (WWC) is sources that identify organizations that foster evidence-
a major organizational initiative supported by the Insti- based policies and practices in, for example, Australia,
tute of Education Sciences (IES) within the U.S. Depart- Canada, Latin America, Sweden, United Kingdom, United
ment of Education (www.whatworks.ed.gov). Its major States, and New Zealand (2004). Some of these organiza-
goal is to synthesize what is known about topic areas that tions were founded at about the same time as the Co-
are nationally important to education.6 Studies associated chrane Collaboration. The EPPI-Centre at the University
with these areas are systematically gathered and subjected of London, to take one example, was founded in 1990.
to a comprehensive three-stage critiquing process. Again,
the focus of the WWC is on the effects of interventions 25.5.2 Additional Government Policies
programs, policies, practices, and products. Studies are and Actions
screened for their relevance to the focal topic, quality of
outcome measures, and adequacy of the data reported. During the past decade, the U.S. Congress and executive
Studies that meet these initial screens are assessed for branch departments and agencies have set the stage for
the strength of the evidence underlying the studys find- greatly enhanced use of meta-analytic methods in policy
ings. There are three classifications for strength of the areas. In particular, there has been a promulgation of
evidencemeet evidence standards, meet evidence stan- lawsfor example, the No Child Left Behind legislation
dards with reservations, and does not meet standards. The calling for educational programs based on scientifically
standards are clearly spelled out for each category and, based researchand policies requiring the use of prac-
similar to the Campbell Collaboration guidelines, evi- tices that have been scientifically shown to be effective.
dence based on well-executed RCTs is given the highest Agencies such as the Agency for Healthcare Research
482 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

Quality, SAMHSAs National Registry of Evidence-based directed at these types of questions. One of these entails
Programs and Practices or NREPP, the joint venture of questions from identifiable members of the policy-mak-
Agency for Healthcare Research Quality (AHRQ), the ing community. Another addresses policy-relevant ques-
American Medical Association (AMA) and the American tions without explicit input from an identifiable member
Public Health Associations (APHA) National Guideline of the policy-making community. Each type implies a
Clearinghouse, and the Center for Disease Controls Com- different approach to the formulation of the questions to
munity Guides have devoted significant resources to iden- be addressed. We refer to the former as policy-driven be-
tifying evidence-based practices. cause it entails translating a request by policy makers into
Interest in improving the efficiency, effectiveness and researchable questions that can be answered, at least in
cost-effectiveness of programs supported by subnational principle, with available research synthesis techniques.
entities also is evident. For example, the Washington State We refer to the latter as investigator-initiated. Such syn-
legislature has appropriated funds to support analyses of theses address topics that are potentially policy-relevant,
the effectiveness and cost-effectiveness of programs, poli- ranging from simple main effect questions like What
cies, and practices. To this end, the Washington State In- works? to more comprehensive questions of What
stitute for Public Policy has issued several reports sum- works, for whom, under what circumstances, how and
marizing the results of their synthesis of a range of why?7 Here, the analyst has considerable degrees of
programs that operate within the state of Washington (for freedom in specifying the nature and scope of the investi-
example, Aos et al. 2004). The analysis is exemplary for gation. Of course, investigator-initiated syntheses must
the technical care with which Steve Aos and his colleagues make a series of assumptions about the informational
conducted their analysis of both the cost and the effects of needs (now and in the future) of members of the policy
programs and also for the clarity of the presentation. They community. We describe the question formulation pro-
provide a model for how complex and technically sophis- cess of each type of policy synthesis in turn.
ticated policy-oriented syntheses can be reported to meet
the potential needs of a variety of potential consumers. 25.6.1 Policy-Driven Research Syntheses
The foregoing shows that a broadly represented policy
shaping community, including members of Congress, Policy-driven research syntheses, those entailing a direct
governmental agencies, courts, universities, and profes- request from policy makers, are relatively rare. These re-
sional associations, has been at work promoting and fa- quests can be broad or narrow. For example, the mandate
cilitating the use of RCTs and methods for identifying issued to the Washington State Institute for Public Policy
evidence-based practices that work. We would be remiss by the Washington legislature directed them to identify
if we did not acknowledge the important role that private evidence-based programs that can lower crime and give
foundations have played in the promotion of meta-analy- Washington taxpayers a good return on their money
sis as a tool for aiding public policy analysis. The Russell (2005, 1). Here, analysts were required to estimate the ef-
Sage Foundation has made a considerable commitment to fects of programs, their cost, and their cost-effectiveness.
understanding how to optimize meta-analytic methods in Other policy-driven requests are more narrowly focused
the policy context. The Smith-Richardson and the Hewlett on questions about specific programs, policies, or prac-
Foundations have provided funds to investigators in- tices that are under active policy consideration. GAOs
volved with the Campbell Collaboration (see Cottingham, evaluation synthesis method was designed explicitly for
Maynard, and Stagner 2005). The Coalition for Evidence- this purpose (see Cordray 1990; GAO 1992).8 Despite
Based Policy has been supported by the Hull Family Foun- their rarity, the available examples of policy-driven stud-
dation, the Rockefeller Foundation, and the William T. ies undertaken in direct response to questions raised by
Grant Foundation. specific policy makers are instructive in illustrating the
kinds of issues that typify policy debates.
25.6 GENERAL TYPES OF POLICY SYNTHESES As noted earlier, questions of who is right about what
works arise and can be profitably addressed by research
The predominant questions within policy-oriented re- synthesis methods. Here, GAO has undertaken several
search syntheses revolve around questions of the effects policy-driven syntheses. In addition to its synthesis of the
of programs, policies, practices, and products. In the pol- WIC program, the credibility of evidence was focal to
icy context, at least two types of research syntheses are their analysis of debates on federal and state initiatives to
RESEARCH SYNTHESIS AND PUBLIC POLICY 483

increase the legal drinking age (GAO 1984, 1987). Not largely as the result of the efforts of organizations de-
all of GAOs syntheses were as highly focused on resolv- scribed in the last section. That is, many of these poten-
ing disputes about the methodological adequacy of earlier tially relevant syntheses are readily available in electronic
studies. In its synthesis on the effects of housing allow- archives, such as those by Campbell and Cochrane col-
ance programs, debates centered on the effects on indi- laborators, and ready to be used as policy issues arise for
vidualssuch as participation rates, equity of access, and a variety of policy makers. Whereas the policy-driven
rent burdenversus the effects on the housing markets, syntheses are reactive to the needs of specific policy
such as housing availability and quality (GAO 1986). makers, those labeled potentially relevant can be viewed
Further, in an unusual application of synthesis methods, as prospective or proactive, attempting to answer policy
Senator John Chafee (RI) asked GAO to determine if questions before they emerge in the policy context. They
there was enough empirical evidence to support the en- are considered potentially useful because they are often
actment of the teen pregnancy programs specified in two completed without an explicit connection to a specific
proposed pieces of legislation (GAO 1986). This applica- policy issue or debate. As such, their utility depends on
tion of synthesis methods was formalized by GAO as a the intuitions of the investigator about what is important
Prospective Evaluation Synthesis methodology (Cordray and their skills formulating the questions to be addressed
and Fischer 1995; GAO 1990). (see Cottingham, Maynard, and Stagner 2005).
By design, GAO structures its syntheses around the is-
sues defining all sides of the policy debate. GAOs ap- 25.7 MINIMIZING BIAS AND MAXIMIZING UTILITY
proach to its synthesis of the effects of WIC was directed
at the policy debate about its effectiveness versus counter Because the policy-driven form of synthesis is rare and
claims of methodological inadequacies. In the end, both generally depends idiosyncratically on the needs of those
sides of the debate were correct. Some credible evidence requesting the synthesis, the issues addressed in this sec-
supported the contention that WIC was effective, and tion are most applicable to investigator-initiated synthe-
some that methodological inadequacies were associated ses. Although these types of syntheses are often obliged
with many previous evaluations of WIC (GAO 1984). to follow the standards and practices developed by one of
GAOs contribution to the debate was that it addressed the major synthesis organizations, such as the Campbell
the policy concerns directly and was able to provide the Collaboration, the investigator still has some degrees of
policy makers with a balanced analysis of what is known freedom in specifying the scope of their synthesis (Cot-
about the effects of WIC and what is not, either because tingham, Maynard, and Stagner 2005).
studies had not been conducted or because they were too Earlier we provided examples of some of the ways
flawed to be relied on. policy-oriented syntheses can be adversely influenced by
Unlike the question formulation process in most types political pressures, ideologies, and wishful thinking of
of research, where the investigator has latitude in deter- analysts. We suggested the need for a firewall to insulate
mining what will be examined and how, in policy-driven as much as possible the science of synthesizing evidence
syntheses the questions are predetermined by, and often from these political forces. It is not possible to provide an
negotiated with, actual policy makers. As seen in previ- exact set of blueprints for constructing such a firewall. On
ous examples, these types of syntheses often involve a the other hand, it is possible to highlight aspects of the
desire on the part of the policy maker, to know more than synthesis process that are vulnerable to biases. The most
answers to questions of what works. Additional questions vulnerable aspects discussed here include formulating
seek differentiated answers about program effects for dif- questions, searching literature, performing analyses, and
ferent demographic subgroups, effects on individuals and reporting results.
higher levels of aggregation (for example, the market
place), and effects across different intervention models or 25.7.1 Principles for Achieving Transparency
program components (for example, what works best).
The policies and practices promulgated and followed by
25.6.2 Investigator-Initiated Policy Syntheses the Cochrane Collaboration and the Campbell Collabora-
tion serve as the building blocks for the firewall needed to
Investigator-initiated syntheses can be regarded as po- minimize the influence of politics on policy syntheses.
tentially relevant to policy questions because they exist Both organizations place premiums on conducting reviews
484 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

and syntheses that are as transparent as possible through- all programs and other policy actions be included in the
out the synthesis process. To achieve this transparency, funding stream. Aos and his colleagues (2004) structured
both organizations have established processes that allow their synthesis across a number of policy goals, identify-
for other experts in the field to evaluate the scope and ing individual programs directed at the broad goals.
quality of the study. These peer reviews are conducted at On the other hand, practitioners such as teachers and
the time the study is proposed and before it is accepted as principals may define a program by the precise shape of
a Cochrane or Campbell Collaboration product. its content (Cottingham, Maynard, and Stagner 2005). In
To ensure that relevant issues are considered before an effort to generate and use evidence to guide public
previous studies are synthesized, Cochrane and Campbell policy, Phoebe Cottingham and her colleagues recom-
investigators are asked to submit the proposed protocol mended that syntheses be structured around well-defined
they will use to execute their synthesis. The items to be synthesis topics more specifically focused on particular
included in the protocol include statements about the interventions (2005). In light of the emphasis on evi-
need for the synthesis, a synopsis of any substantive de- dence-based practices in legislation like No Child Left
bates in the literature (or policy communities), and a de- Behind, the differences between policy makers and prac-
tailed plan for study selection, retrieval, data extraction, titioners in views of programs is not as stark as Cotting-
and analysis. The proposal also requests synthesists to ham and her colleagues initially portrayed (2005). Rather
state in writing any real or perceived conflicts of interest than focusing on broad policy goals, policy makers in
pertaining to the substance of the work. Finalized proto- Congress sided with the needs of practitioners.
col proposals are published within the archives. The final 25.7.2.1 Optimizing Utility The different emphases
report for each synthesis is also examined to ensure that of policy makers and practitioners suggest that the utility
the study plan was followed and that analyses are fair and of a given synthesis could be jeopardized for one of these
balanced. two distinct audiences, depending on whether the defini-
tion of the policy object was broadly or narrowly defined.
25.7.2 Question Specification Several practices have been used to optimize the utility of
a given synthesis across potential audiences. For exam-
Questions about public policies can take on a number of ple, in their study of the effects of Comprehensive School
forms and require the use of a variety of social science Reform (CSR) on student achievement, Geoffrey Bor-
methods. Although meta-analytic methods have been man and his colleagues constructed the meta-analysis so
used to answer questions like what the prevalence of a that they could estimate the overall effects of CSR as a
problem is (see Lehman and Cordray 1993), the recent policy and the separate effects of the twenty-nine CSR
interest in evidence-based practices is squarely focused program models that had been implemented in schools
on identifying programs (interventions), policies, prac- (2003). The Institute of Education Sciences (IES) What
tices and products that work. To answer these questions, Works Clearinghouse develops two types of reports to
several aspects of the synthesis need to be specified and maximize the utility of the work for multiple audiences.
several key decisions affect the nature and scope of the Intervention Reports are targeted to specific programs
synthesis. Chief among these decisions is the specifica- (for example, Too Good 4 Drugs) and products (publish-
tion of what we mean by the what of the what works ers reading programs like Read Well). Topic Reports are
question. Several alternative formulations of what will af- designed to examine effects of multiple programs, prac-
fect the synthesis and its utility to the intended audience. tices, and products on a focal area in education. This
A range of advice has been offered in defining the ob- combination of products meets the needs of multiple au-
ject (the what) of policy-oriented syntheses. Phoebe Cot- diences. Borman and his colleagues analysis of the ef-
tingham, Rebecca Maynard, and Matt Stagner held that fects of different comprehensive school reform models
policy-makers and practitioners often view programs and the overall effects across models is another example
and policies differently (2005, 285). For policy makers, of how the effects can be presented to meet multiple in-
a program may represent a funding stream, with multiple formation needs of users (2003).
components being implemented in an effort to achieve a Of course, examining effects across interventions is not
specific goal, such as improving achievement in U.S. simply a matter of aggregating the results from separate
public schools. To be responsive and useful, a synthesis intervention reports. Between-study (for example, quality
directed at this type of policy maker would require that of the research design, types of outcomes, and implemen-
RESEARCH SYNTHESIS AND PUBLIC POLICY 485

tation fidelity) and between-intervention (for example, program, and from differences in which and how out-
types of clients served, costs, and intervention duration) comes are measured (GAO 1984, 1986, 1987). Thus the
factors need to be accounted for in the coding protocol and first consideration in specifying the parameters of a search
analyses. Sandra Wilson and Mark Lipsey provided an is a thorough understanding of the policy-debates and the
example of this form of synthesis (2006). Mark Lipseys evidence used by parties in those debates.
analysis of juvenile delinquency intervention models A broader basis for the search is likely to be needed,
shows the differential effects of conceptually distinct types relative to a more traditional synthesis of the academic
of intervention models, such as behavioral and affective literature base. Lipsey and Wilson provided a broad-
(1992). These types of syntheses need to be conceptual- based, policy-relevant list of data bases and electronic
ized as such early in the synthesis process. The coding sources (2001, appendix A). Their list reminds us that
protocol for syntheses are much more complex than those evaluation reports do not always become part of the pub-
for studies that focus on estimating simple average effects lished literature. Borman and his colleagues used two da-
based on the unit of assignment to conditions, for exam- tabases developed by a research firm (American Institutes
ple, WWC Intervention Reports (see Wilson and Lipsey for Research, or AIR) and a regional education lab sup-
2006; Lipsey (1992)). This approach is often referred to as ported by the U.S. Department of Education (Northwest
an intent-to-treat analysis, retaining all participants in the Regional Education Laboratory, or NWREL) to identify
analysis regardless of whether they did not comply with districts that implemented and evaluated Comprehensive
their treatment assignment (for example, drop-outs, cross- School Reform models (2003). GAO relied on agency
overs, incomplete receipt of the intended treatment dose). field offices in its search for local evaluations of the WIC
program (1984). Early syntheses are another albeit obvi-
25.7.3 Searches ous resource. The Campbell Collaboration now houses
SPECTR, a repository of experiments in education, jus-
EPAs meta-analysis on the effects of second-hand smoke tice, and other social areas.
on health status did not include every relevant study Contacting experts and program developers, looking at
(1992). Representatives of the tobacco industry were program websites and agency officials to ensure the com-
quick to point out this omission. Had the results of those prehensiveness of lists of prior studies is becoming com-
studies been included, the aggregate results would have mon practice. The foregoing is not to imply that all stud-
shown even weaker effects of secondhand smoke. The ies must be included in the synthesis but instead that all
exclusion of other interventions (beyond systematic pho- relevant studies must be considered within the synthesis.
nics instruction) in the meta-analysis conducted for the Those that fail to meet quality standards can be identified
National Reading Panel (2000) was spotted by Camilli and legitimately excluded from further consideration.
and others (2003). Further, their reanalysis of the meta- Agreement on the exclusion of studies, however, is not
analytic data base found additional relevant studies that widespread.
had not been included in the original NRP syntheses. It
also showed different results and implied different policy 25.7.4 Screening Studies
conclusions than those offered by the NRP. Both of these
examples highlight the critical importance of a compre- As previously described, the WWC uses a well-structured
hensive search, retrieval, coding, and analysis process. set of procedures to screen studies on their relevance to
The comprehensiveness of the search process is as impor- synthesis goals. Screening studies for adequate popula-
tant to policy-oriented syntheses as traditional ones. Is tion coverage, the intervention (for example, Wilson and
there anything unique about the search process in policy- Lipsey 2006), outcomes, geography, and time frame are
oriented syntheses? An examination of the search strate- particularly important when the policy questions entail
gies in such studies suggests a few issues that deserve changes to existing laws or practices connected to the
attention. governance structure of a particular country. Cottingham
Disputes about the credibility of evidence underlying a and her colleagues noted that the welfare system in the
programs effects need to be carefully examined. These United Kingdom is different enough from its counterpart
can arise from differences in perspectives about the defi- in the United States and that search strategies need to be
nition of evidence, they can arise from selective use of the delimited to studies from a given country (2005). These
literature by opponents and proponents of the policy or considerations depend on the policy topic. For example,
486 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

studies on the effects of specific medical treatments, such cal considerations. This prospect has resulted in a vigorous
as secondhand smoke and radiofrequency denervation, discussion about how to accommodate the trade-off be-
may not require such delimitations. tween wanting to rely on high standards of evidence and
Screening for policy relevance is the first part of the wanting to contribute to the policy discussion in a useful
study selection task. The final set of studies to be included and timely fashion. Three general approaches have been
in a synthesis will be those that pass screens on their rel- used. The first entails classifying studies based on the de-
evance to policy area and the quality of evidence stan- gree to which they meet technical standards. We refer to
dards. this as the level of evidence approach. The second ap-
proach can be thought of as a variation of the first and was
25.7.5 Standards of Evidence coined best-evidence synthesis by Robert Slavin (1986,
2002). The third uses statistical models (for example, meta-
Within the policy context, the credibility of evidence un- regression) to control for the influence of study quality.
derlying the studies included in a synthesis is fundamen- This approach is well entrenched in the short method-
tal. We saw that concerns about the inadequate quality of ological history of meta-analysis. We refer to this as the
the epidemiological studies, mainly case-control studies, meta-regression adjustment approach.
were, in part, the basis for discrediting the meta-analysis 25.7.5.1 Level of Evidence Approach The main idea
by EPA on the effects of secondhand smoke (EPA 1992). behind the level of evidence approach is that some de-
Concerns about the quality of evidence by opponents of signs produce estimates that are more or less trustworthy.
the WIC program were pivotal in instigating Senator As such, the synthesis should take these different levels
Helms request for the synthesis by GAO (1984). In both of credibility into account. We identify two forms of this
these cases, the questions turned around whether the re- approach. First is the purest synthesis, which is based
search designs were adequate to support causal claims of only on the results from randomized control trials (RCTs).
the effects of the program (WIC) or purported causal A variation segments or disaggregates the results into
agent (secondhand smoke). Concern about the quality of more than the two categories implied in the purist case
evidence underlying claims of scientifically based prac- (that is, meets evidence standards or not).
tices was important enough to the framers of the reautho- Several syntheses undertaken within the Campbell
rization of ESEA Title I (No Child Left Behind) to compel Collaboration guidelines, exemplify the purist approach
them to repeat their request more than 100 times through- (Petronsino, Turpin-Petrosino, and Buehler 2003; Nye,
out the legislation. Interest in evidence-based practices Turner, and Schwartz 2003; Visher, Winterfield, and
has prompted a rash of meta-analyses that redo prior syn- Coggeshall 2006). Not all syntheses conducted under the
theses in an effort to isolate effects based on randomized auspices of the Campbell Collaboration hold as firmly to
experiments rather than all available literature. this purest notion (for example, Wilson and Lipsey 2006).
The major organizations conducting policy-oriented The synthesis by Anthony Petrosino and his colleagues
research synthesesthe Cochrane Collaboration, the examined the effects of Scared Straight and other juve-
Campbell Collaboration, and the What Works Clearing- nile awareness programs for preventing juvenile delin-
house within the Institute of Education Sciencesfollow quency (2003). They uncovered 487 studies, of which
the premise that evidence from randomized controlled thirty were evaluations and eleven potentially random-
trials should be the gold standard for judging whether an ized trials. After additional screenings, they included
intervention works. An intent-to-treat estimate from well- seven studies in their meta-analysis. Here, the conse-
executed RCTs is regarded as the most compelling evi- quence of holding high standards is a reduced number of
dence of the direction and the magnitude of interventions available studies. This can become a severe problem.
effect. These estimates provide a statistically sound basis Visher and her colleagues found only eight RCTs, but
for meta-analysis and other forms of quantitative synthe- only after expanding the scope of the synthesis (2006).
sis (for example, cost-effective). Nye and his colleagues on parent involvement on achieve-
On the other hand, requiring that interventions be eval- ment had somewhat more success in finding RCTs, in
uated with RCTs is a high standard that can affect the that nineteen met criteria (2003). This synthesis, though
utility of policy-oriented syntheses. To the extent that ear- focused on intent-to-treat estimates in its synthesis, de-
lier studies fail to meet this standard, the policy commu- parted from the purest point of view by incorporating a
nity has to revert to decision making based on nonempiri- series of interesting post hoc and moderator analyses.
RESEARCH SYNTHESIS AND PUBLIC POLICY 487

25.7.5.2 Segmentation by Credibility or Strength RCTs become available, the effects of an intervention
of Evidence Rather than creating a dichotomy like ac- would be reexamined in light of this new and more com-
ceptable or unacceptable, the technical adequacy of a pelling evidence base. Although Slavins ideas had not
study can be characterized in a more differentiated fash- yet appeared in the literature at the time GAO conducted
ion, allowing for degrees of acceptability. As described its research synthesis of the WIC program, analysts used
earlier, IES and the What Works Clearinghouse have de- a form of best-evidence synthesis in selecting the subsets
veloped three categories of evidence: meets standards, of studies that were used to derive their conclusions for
meets standards with reservations, and does not meet each question posed by Senator Helms. GAO analysts
standards. The intervention reports classify each study and their consultants (David Cordray and Richard Light)
according to this scheme and report on the effects of stud- rated studies on ten dimensions. Studies deemed adequate
ies in the first two categories. To date, the WWC synthe- on four of these dimensions were used in the synthesis.
sists have completed reports for thirty-one specific inter- None of the available studies were rated a 10. There were
ventions in six policy areas. Of these, eighteen were too few studies (for example, six for the main outcome on
based on at least one study that met standards evidence. birth weight) to segment them into good, better, and best
The evidence was based on a well-executed randomized categories.
control trial or, a quality quasi-experiment, or a quality 25.7.5.4 Meta-Regression Adjustment Approach
single subject research design. In thirteen reports, the One of the most common forms of synthesis entails the
highest level of design quality was meets standards with use of regression models, hence the term meta-regression.
reservations. Often, an intervention involved a mixture of This represents a general analytic approach to account for
studies that met standards or met standards with reserva- between study differences in methodology, measurement,
tions. Only ten reports were based on at least one study participants, and attributes of the intervention under con-
that met the highest standards and did not rest on any sideration. Rather than segmenting the studies into levels
other studies that met standards but with reservations. It of evidence, design differences (and more) are examined
should be apparent that this form of segmentation pre- empirically. In their initial analysis, Borman and his col-
serves the goal of identifying interventions that are based leagues included all studies, regardless of the method,
on the most trustworthy evidence (RCTs, high quality from pre-post only through randomized and varying types
quasi-experiments, and well-executed single subject or of quasi-experiments (2003). Their approach also exam-
case studies), but also provides policy makers and practi- ined the effects of other methods variables, reform attri-
tioners with an overview of what is known from studies butes (parental governance), cost, and other nonmethods
with less credibility. Although not operating under the factors that could affect the results, such as the degree of
guidelines of any organization, Borman and his colleagues developer support available in each study.
used a segmentation scheme that could be plainly under- As noted, not all syntheses conducted under the aus-
stood by a variety of user groups (2003). They classified pices of the Campbell Collaboration are restricted to
program models into four groups: strongest evidence, RCTs. Wilson and Lipsey included studies with control
highly promising, promising, and in greatest need for ad- groups, both random or nonrandom (2006). In their study,
ditional research. With proper caveats, the segmentation the nonrandom studies needed to show evidence of initial
of interventions by variation in the strength of evidence equivalence on demographic variables. They found eighty-
on its effectiveness gives the users something to base nine eligible reports and included seventy-three in their
their decisions on, rather than having to rely on expert or synthesis. Their analyses controlled for four methods fac-
personal opinion in picking an intervention or to stay the tors. In a unique approach to identifying potential media-
course. tor variables, they examined the influence of program and
25.7.5.3 Best-Evidence Syntheses Robert Slavin de- participant variables on the outcome (effect sizes), con-
scribes best-evidence synthesis as an alternative to meta- trolling for the influence of between-study methods arti-
analysis and traditional reviews (1986, 2002). It uses two facts. Moderator variables that survived this screen were
of the major features of meta-analysisquantification of included in their final models. They reported no differ-
results as effect sizes and systematic study selectionbut ences in effects between studies based on randomized
argues that the synthesis should be based on the best designs and those based on well-controlled, nonrandom
available evidence. If RCTs are not available, the next designs. Had they found differences between these two
best evidence would be used. Over time, as results from types of designs, additional controls would have been
488 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

needed (for example, additional stratifications or sensi- ment interventions was not related to effects but that the
tivity analyses). type of parent involvement was (2003). Here, providing
25.7.5.5 Combined Strategies Using two or more of incentives to parents made a difference, as did parent ed-
the three approaches outlined above to control or adjust ucation and training. Their analyses also revealed effects
for differential design quality does not appear to be as for reading and math achievement, but not for science. In
pervasive in the recent synthesis literature as is warranted. another example, Harris Cooper, Jorgianne Robinson,
However, as noted, one good example of how to meet the and Erika Patall showed that part of the confusion that
information needs of multiple perspectives on a given pervaded the literature about the effects of homework on
policy or program is provided by Borman and his col- achievement could be resolved by differentiating effects
leagues (2003). They first examined the overall effects of by grade and development stage (2006). Cooper and his
the comprehensive school reform models as a broad- colleagues analysis is a very good example of how meta-
based program. Using a meta-regression approach, they analyses not exclusively focused on main effects, such as
included all studies, regardless of the research design, intent-to-treat estimates, can be used to generate impor-
and segmented the models according to the overall strength tant hypotheses for future research.
of the evidence upon which they rested. They offered ex- Given the breadth of policy issues, it is difficult to say
tended analyses of those models based on the best-evi- much about the types of moderator variables that should
dence approach. Fortunately, unlike the 1984 GAO anal- or could be included in policy-oriented syntheses. How-
ysis of WIC, several models had been evaluated with the ever, based on the kinds of issues raised by policy makers
best types of research designs. The usefulness of the re- and practitioners, a few seem as if they would be generi-
sults was enhanced by other variables being included in cally policy relevant. Our short list of potentially relevant
the meta-regression. These other policy-oriented vari- moderator variables include implementation fidelity; du-
ables provided the reader with an idea of whether the fea- ration, frequency, and intensity of interventions; setting
tures of the program made a difference. We use the term within which the intervention was tested (demonstration
idea to highlight the fact that these results are inherently school versus regular school); who tested the intervention
correlational unless key features of the treatment are ran- (developer or third-party); intervention (program) cost;
domly assigned to persons within conditions. and risk categories for participants.

25.7.6 Moderator Variables 25.7.7 Reporting and Communicating Results


Within the evidence-based practices movement, the em- Effect sizes have been the workhorse of syntheses. They
phasis is on answering the causal question, what works? have definable statistical qualities. They have a certain
Reliance on RCTs seems to have diverted attention from degree of intuitive appeal. It is relatively easy to under-
answering an important broader set of policy questions stand odds ratios. The Glass-Hedges effect size calcula-
about for whom, under what circumstances, how and tion is not complex and can be easily interpreted within
why. These are given less priority because they are hard normative frameworks that depend on sensible conven-
to answer within the paradigm focused on the use of tions (Cohen 1988). On the other hand, their intuitive ap-
RCTs. Potential moderator variables like implementation peal is not universal. The translation of an effect size into
fidelity, attributes of the participants, and context factors more meaningful indicators has received some well-de-
are not randomly assigned to studies, and their influence served attention. For example, the improvement index
is not easy to estimate because of specification errors. used by the What Works Clearinghouse tells the reader
Including moderator variables ought to be encouraged how much difference an intervention is expected to make
because they are potentially valuable to policy makers, relative to the average of the counterfactual group. Rather
researchers, program developers, and other practitioners. than being expressed as an effect size, the index is ex-
However included, it is important that readers are pro- pressed as a percentage of improvement that ranges be-
vided a thorough discussion of the strengths and weak- tween 50 and 50. Whether this index is meaningful to
nesses of results presented. With that caveat in mind, we policy makers and other users is not yet clear.
present a few examples that show the potential value of In another effort to translate effect sizes into meaning
analyses of moderator variables to a range of policy mak- units, Wilson and Lipsey calibrated the meaning of an ef-
ers. Nye and others showed that length of parent involve- fect sizes relative to typical levels of aggressive behavior
RESEARCH SYNTHESIS AND PUBLIC POLICY 489

(2006). Using this survey data from the Centers for Dis- to minimize their influence have been mounted by major
ease Control and Prevention as a benchmark, they were organizations (Campbell Collaboration, Cochrane Col-
able to derive a prevalence rate for population members laboration, and WWC) in the form of well-articulated
as if they were given the intervention. They found that policies and practices. The synthesis methods developed
15 percent of adolescents would exhibit unwanted behav- by these organizations are an important step toward creat-
ior (if untreated), given an effect size of 0.21, the expected ing the much-needed firewall between the science of the
population rate would be 8 percent if the intervention was syntheses and the political pressures inherent in the pol-
implemented. icy context.
Several other approaches offered by specialists in This chapter offers three essential messages. First is
meta-analysis are available. Cottingham and her col- that the manner in which research syntheses are concep-
leagues (2005) made a strong case for using natural units tualized, conducted, and reported depend on what we
wherever possible. That is, they advised against automat- mean by the phrase policy makers and practitioners. In
ically converting results into effect sizes. William Shad- discussions of the potential role of meta-analysis or, more
ish and Keith Haddock provided guidance on aggregating broadly, research synthesis, there is a tendency to not dif-
evidence in natural units (1994). A second approach to ferentiate these two groups. It is as if we are referring to
improving communication of policy-relevant results is a homogeneous collection of individuals (that is, policy-
through the use of visual presentations (see Light, Singer, makersandpractitioners). Second is that the goals for all
and Willett 1994). Funnel, stem-and-leaf plots, figures research syntheses should be to minimize the influence of
displaying case-by-case representations of effect sizes personal and methodological biases. Although we should
and 95 percent confidence intervals, and ordinary plots of minimize the influence of bias, analysts also need to max-
the linear, interactive, and other nonlinear effects of mod- imize the utility of their product by sharply framing their
erators can go a long way toward making the results of questions in light of contemporary policy issues or de-
meta-analysis comprehensible to a wide range of policy bates (that is, optimizing relevance) and crafting their
makers and practitioners. synthesis methodology so that answers can be provided
in meaningful and accurate ways. Finally, though it is un-
likely that any single policy-oriented research synthesis
25.8 CONCLUSION can satisfy the information needs of all members of a
policy shaping community, there is a set of methods that
Extensive financial and intellectual resources have been can enhance the utility of a research synthesis. Approaches
directed toward identifying policies, programs, and prac- to enhancing the utility of syntheses across diverse mem-
tice based on quality evidence. Given the relative new- bers of the policy shaping community include the seg-
ness of interest in evidence-based practices (EBP), it is mentation (or differentiation) of results by differences in
not surprising that policy-oriented syntheses are only the quality of evidence available, the use of a set of ana-
now beginning to be made public. It would appear that lytic approaches that include intent-to-treat estimates and
the EBP movement has afforded meta-analysis a perma- follow-up analyses of policy-relevant moderator vari-
nent seat at the policy table. ables, and the use of policy-meaningful indices of effects
The current low profile of meta-analysis in traditional (such as natural units of measurement) and common lan-
resources such as bibliometric databases could be simply guage descriptions of results.
because of the relatively newness of these forms of in-
quiry. After all, it was only a little more than fifteen years
ago that major efforts to synthesize RCTs in medicine 25.9 ACKNOWLEDGMENTS
were initiated, the Cochrane Collaboration and EPPI-
Centre at the University of London, to take two examples, Paul Morphys contribution to this chapter was supported
were established in the early 1990s. by grant number R305B0110 (Experimental Education
More important than the volume of policy-oriented Research Training or ExpERT, David S. Cordray, Train-
syntheses currently available is the development of guide- ing Director) from the U.S. Department of Education,
lines and procedures to make them as fair and honorable Institute of Education Sciences (IES). The opinions ex-
as possible. Given that politics and personal agendas can pressed in this chapter are our own and do not reflect
influence the quality of policy-oriented syntheses, efforts those of the funding agency.
490 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

25.10 NOTES house simply reports effect estimates without any fur-
ther analysis of between study differences or subgroup
1. The U.S. Government Accountability Offices (GAO, analyses.
formerly the U.S. General Accounting Office). 8. By way of some background, the majority leader of
2. We recognize that variety of narrative syntheses of committees of the U.S. Congress frequently ask the
evidence related to policy-oriented topics have been U.S. Government Accountability Office to examine
undertaken throughout contemporary history. In this what is known (and not known) about the effects or
section we focus on the prevalence of quantitative syn- cost-effectiveness of government programs, policies
theses in policy areas. and practices. As a research-arm of Congress, GAO is
3. The Web of Science searches used the following pa- organizationally independent of the administrative
rameters: Meta-analysis (metaanalysis OR meta- branch programs it evaluates.
analysis); policy ((policy* OR policies) OR (legislat*
OR legal); meta-analysis and policy (metaanalysis
OR meta-analysis) AND (policy* OR policies). Broader 25.11 REFERENCES
specification for meta-analysis-like topics that could
appear in the literature (for example, quantitative syn- Aos, Steve, Roxanne Lieb, Jim Mayfield, Marna Miller, and
thesis and research synthesis) did not increase the Annie Pennucci. 2004. Benefits and Costs of Prevention and
number of instances that we found. Including the term Early Intervention Programs for Youth. Olympia: Washing-
synthesis brought into the listing such topics chem- ton State Institute for Public Policy.
ical synthesis were too ill-targeted. Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman,
4. The Web of Science may have changed its methods of and Shelly Brown. 2003. Comprehensive School Reform
data capture between early and later years. The figures and Achievement: A Meta-Analysis. Review of Educatio-
(for example, journal coverage) from 1990 forward nal Research 73(2): 125230.
are likely to be more accurate. Boruch, Robert F. 1997. Randomized Experiments for Plan-
5. The Campbell Collaboration was founded under the ning and Evaluation: A Practical Guide. Thousand Oaks,
able leadership of Robert F. Boruch, University of Calif.: Sage Publications.
Pennsylvania. Boruch, Robert F., and Frederick Mosteller, eds. 2001. Evi-
6. An anonymous reviewer noted that some of the policy dence Matters. Washington, D.C.: Brookings Institution
organizations that we discuss later in the chapter use a Press.
limited set of the array of procedures that customarily Camilli, Gregory, Sadako Vargas, and Michele Yurecko.
define a meta-analysis. For example, the What Works 2003. Teaching Children to Read: The Fragile Link
Clearinghouse, supported by the Institute of Educa- Between Science and Federal Education Policy. Educa-
tion Sciences (IES) simply reports the separate effect tion Policy Analysis Archives 11(15). http://epaa.asu.edu/
sizes from studies of educational programs and prod- epaa/v11n15/
ucts that they identify, screen, and rate methodological Chall, Jeanne S. 1983. Learning to Read: The Great Debate,
quality. These practices are due, in part, to the simple 2nd ed. New York: McGraw-Hill.
fact that they have found a relatively small number of . 1992. The New Reading Debates: Evidence from
studies (within program and product categories) that Science, Art, and Ideology. The Teachers College Record
are judged to be adequate to support a causal inference 94(2): 31528.
of effectiveness. This does not mean when more evi- . 2000. The Academic Achievement Challenge: What
dence is available that the full repertoire of meta-ana- Really Works in the Classroom. New York: Guilford Press.
lytic methods will not be employed. Cohen, Jacob. 1988. Statistical Power Analysis for the Beha-
7. In primary research studies the analyses often address vioral Science, 2nd ed. Hillsdale, N.J.: Lawrence Erlbaum.
questions beyond whether an intervention works. In Coggins, Christopher. 1994. Statement of the Tobacco Insti-
policy syntheses the goal is to identify quality studies tute. Testimony before the U.S. Congress. Senate Commit-
(for example, RCTs) and, if possible, aggregate the tee on Environment and Public Works. Subcommittee on
effect estimates that are extracted. For example, at this Clean Air and Nuclear Regulation, 103rd Cong., 2nd sess.
point in its development, the What Works Clearing- May 11, 1994.
RESEARCH SYNTHESIS AND PUBLIC POLICY 491

Cook, Thomas D., Harris Cooper, David S. Cordray, Heidi Farland, William, Steven Bayard, and Jennifer Jinot. 1994.
Hartmann, Larry V. Hedges, Richard J. Light, Thomas A. Environmental Tobacco Smoke: A Public Health Conspi-
Louis, and Frederick Mosteller. 1992. Meta-Analysis for racy? A Dissenting View. Journal of Clinical Epidemio-
Explanation: A Casebook. New York: Russell Sage Foun- logy 47(4): 33537.
dation. Glass, Gene V. 1976. Primary, Secondary, and Meta-Analy-
Cooper, Harris M. 1986. On the Social Psychology of Using sis Research. Educational Researcher 5(10): 38.
Research Reviews: The Case of Desegregation and Black Gori, Gio Batta. 1994a. Science, Policy, and Ethics. The
Achievement. In The Social Psychology of Education, edi- Journal of Clinical Epidemiology 47(4): 32534.
ted by Richard Feldman. Cambridge: Cambridge Univer- . 1994b. Reply to the Preceding Dissents. The
sity Press. Journal of Clinical Epidemiology, 47: 35153.
. 1989. Integrative Research: A Guide for Literature Gravelle, Jane G., and Dennis Zimmerman. 1994a. Evalua-
Reviews, 2nd ed. Newbury Park, Calif.: Sage Publications. tion of ETS Exposure Studies. Testimony before the U.S.
. 2005. Reading Between the Lines: Observations Congress. Senate Committee on Environment and Public
on the Report of the National Reading Panel and Its Cri- Works. Subcommittee on Clean Air and Nuclear Regula-
tics. Phi Delta Kappan 86(6): 45661. tion, 103rd Cong., 2nd sess. May 11, 1994.
Cooper, Harris M., and Larry V. Hedges, eds. 1994. The . 1994b. Cigarette Taxes to Fund Health Care Re-
Handbook of Research Synthesis. New York: Russell Sage form: An Economic Analysis. Report prepared for the U.S.
Foundation. Congress. Senate Committee on Environment and Public
Cooper, Harris M., and Jeffrey C. Valentine. 2001. Using Works. Subcommittee on Clean Air and Nuclear Regula-
Research to Answer Practical Questions about Homework. tion, 103rd Cong., 2nd sess. March 8, 1994.
Educational Psychologist 36(3): 14354. Hunt, Morton. 1997. How Science Takes Stock. New York:
Cooper, Harris M., Jorgianne C. Robinson, and Erika A. Pa- Russell Sage Foundation.
tall. 2006. Does Homework Improve Academic Achieve- Hunter, John, and Frank Schmidt. 1990. Methods of Meta-
ment? A Synthesis of Research 19872003. Review of Analysis: Correction Error and Bias in Research Findings.
Educational Research 76(1): 162. Newbury Park, Calif.: Sage Publications.
Cordray, David S. 1990. The Future of Meta-Analysis: An Hunter, John, Frank Schmidt, and Gregory Jackson. 1982.
Assessment from the Policy Perspective. In The Future of Meta-Analysis: Cumulating Research Findings Across Stu-
Meta-Analysis, edited by Kenneth W. Wachter and Miron dies. Beverly Hills, Calif.: Sage Publications.
L. Straf. New York: Russell Sage Foundation. Jinot, Jennifer, and Steven Bayard. 1994. Respiratory Health
. 1994. Strengthening Causal Interpretation of Non- Effects of Passive Smoking: EPAs Weight of Evidence
Experimental Data: The role of Meta-Analysis. New Di- Analysis. The Journal of Clinical Epidemiology 47(4):
rections in Program Evaluation 60: 5996. 33949.
Cordray, David S., and Robert L. Fischer. 1995. Evaluation Landenberger, Nana A., and Mark W. Lipsey. 2005. The
Synthesis. In Handbook of Practical Program Evaluation, Positive Effects of Cognitive-Behavioral Programs for Of-
edited by Joseph Wholey, Harry Hatry, and Katherine New- fenders: A Meta-Analysis of Factors Associated with Ef-
comer. San Francisco, Calif.: Jossey-Bass. fective Treatment. Journal of Experimental Criminology
Cottingham, Phoebe, Rebecca Maynard, and Matt Stagner. 1(1): 45176.
2005. Generating and Using Evidence to Guide Public Po- Lehman, Anthony, and David S. Cordray. 1993. Prevalence
licy and Practice: Lessons from the Campbell Test-Bed Pro- of Alcohol, Drug, and Mental Health Disorders Among the
ject. Journal of Experimental Criminology 1(3): 27994. Homeless: A Quantitative Synthesis. Contemporary Drug
Cronbach, Lee J. 1982. Designing Evaluations of Educational Problems 1993(Fall): 35583.
and Social Programs. San Francisco, Calif.: Jossey-Bass. Lipsey, Mark W. 1992. Juvenile Delinquency Treatment: A
Environment Protection Agency 1992. Respiratory Health Meta-Analytic Inquiry into the Variability of Effects. In
Effects of Passive Smoking: Lung Cancer and Other Disor- Meta-Analysis For Explanation: A Casebook, edited Tho-
ders. Washington, D.C.: United States Environmental Pro- mas D. Cook, Harris M. Cooper, David S. Cordray, Heidi
tection Agency. Hartmann,, Larry V. Hedges, Richard J. Light, Thomas A.
Environmental Protection Agency 1986. Risk Assessment Louis, and Frederick Mosteller. New York: Russell Sage
Guidelines. 51 Fed. Reg. 33992. Foundation.
492 TYING RESEARCH SYNTHESIS TO SUBSTANTIVE ISSUES

Lipsey, Mark W., and David Wilson. 2001. Practical Meta- Intervention and Policy Evaluations (C2-RIPE), November
Analysis. Newbury Park, Calif.: Sage Publications. 2003. Philadelphia, Pa.: Campbell Collaboration.
Light, Richard J., and David Pillemer. 1984. Summing Up: Redhead, C. Stephen, and Richard E. Rowberg. 1998. CRS
The Science of Reviewing Research. Cambridge, Mass.: Report for Congress: Environmental Tobacco Smoke and
Harvard University Press. Lung Cancer Risk. Washington, D.C.: Congressional Re-
Light, Richard J., and Paul Smith. 1971. Accumulating Evi- search Service.
dence: Procedures for Resolving Contradictions Among Sackett, David L., William M. C. Rosenberg, J. A. Muir
Different Studies. Harvard Educational Review 41(4): Gray, R. Brian Haynes, and W. Scott Richardson. 1996.
42971. Evidence-Based Medicine: What It Is and What It Isnt.
Light, Richard J., Judith Singer, and John Willett. 1994. British Journal of Medicine 312(7023): 7172.
The Visual Presentation and Interpretation of Meta-Ana- Schneider, Jeffrey M. 1990. Research, Meta-Analysis, and
lyses. In The Handbook of Research Synthesis, edited by Desegregation Policy. In The Future of Meta-Analysis,
Harris M. Cooper and Larry V. Hedges. New York: Russell edited by Kenneth W. Wachter and Miron L. Straf. New
Sage Foundation. York: Russell Sage Foundation.
Mullen, Edward J. 2004. Evidence-based Policy and Shadish, William R., and C. Keith Haddock. 1994. Combi-
Practice. and Columbia University in the City of New ning Estimates of Effect Size. In The Handbook of Re-
York. http://www.columbia.edu/cu/musher/Internet percent search Synthesis, edited by Harris M. Cooper and Larry V.
20Links percent20for percent20Bibliography.htm Hedges. New York: Russell Sage Foundation.
National Reading Panel. 2000. Report of the National Rea- Slavin, Robert E. 1986. Best-Evidence Synthesis: An Alter-
ding Panel: Teaching Children to ReadAn Evidence- native to Meta-Analysis and Traditional Reviews. Educa-
Based Assessment of the Scientific Research Literature on tional Researcher 15(9): 511.
Reading and its Implications for Reading Instruction. . 2002. Evidence-Based Education Policies: Trans-
Washington, D.C.: National Institute of Child Health and forming Educational Practice and Research. Educational
Human Development. Researcher 31(7): 1521.
Niemisto, Leena, Eija Kalso, Antti Malmivaara, Seppo Seit- U.S. General Accounting Office. 1984. WIC Evaluations
salo, and Heikki Hurri. 2003. Radio Frequency Denerva- Provide Some Favorable but No Conclusive Evidence on
tion for Neck and Back Pain. Cochrane Database of Sys- Effects Expected for the Special Supplemental Program for
tematic Reviews 2003(1): CD00400048. Women, Infants, and Children. GAO/PEMD-844. Washing-
Nye, Chad, Herb Turner, and Jamie Schwartz. 2003. Ap- ton, D.C.: U.S. Government Printing Office.
proaches to Parent Involvement for Improving the Acade- . 1986. Housing Allowances: An Assessment of Pro-
mic Performance of Elementary School-Age Children. In gram Participation and Effects. GAO/PEMD-865.Washing-
Campbell Collaboration Reviews of Intervention and Po- ton, D.C.: U.S. Government Printing Office.
licy Evaluations (C2-RIPE). Philadelphia, Pa.: Campbell . 1987. Drinking Age Laws: An Evaluation Synthesis
Collaboration. of Their Impact on Highway Safety. GAO/PEMD-8710.
Office of Management and Budget. 2004. Final Informa- Washington, D.C.: U.S. Government Printing Office.
tion Quality Bulletin for Peer Review. http://www.white- . 1990. Prospective Evaluation Synthesis. Washing-
house.gov/omb/inforeg/peer2004/peer_bulletin.pdf ton, D.C.: U.S. Government Printing Office.
Osteem, William L. 1998. Opinion. In Flue-cured Tobacco . 1992. The Evaluation Synthesis. GAO/PEMD-10.1.2.
Cooperative Stabilization Corporation et ux. v. United Sta- Washington, D.C.: U.S. Government Printing Office.
tes Environmental Protection Agency. 4 F. Supp. 2d 435: Visher, Christy A., Laura Winterfield, and Mark B. Cogges-
1998 US Dist. LEXUS 10986; 1998 OSHD (CCH) P31, hall. 2006. Systematic Review of Non-Custodial Employ-
3614:28 ELR 21445. United States District Court for the ment Programs: Impact on Recidivism Rates of Ex-Offen-
Middle District of North Carolina. ders. In Campbell Collaboration Reviews of Intervention
Pearson, Karl. 1904. Antityphoid Inoculation. British Me- and Policy Evaluations (C2-RIPE), February 2006. Phila-
dical Journal 2(2294): 1243246. delphia, Pa.: Campbell Collaboration.
Petrosino, Anthony, Carolyn Turpin-Petrosino, and John Wachter, Kenneth W., and Miron L. Straf, eds. The Future of
Buehler. 2003. Scared Straight and Other Juvenile Aware- Meta-Analysis. New York: Russell Sage Foundation.
ness Programs for Preventing Juvenile Delinquency: Camp- Whitehurst, Grover. 2003. The Institute of Education Scien-
bell Review Update. In Campbell Collaboration Reviews of ces: New Wine, New Bottles. Paper presented at the An-
RESEARCH SYNTHESIS AND PUBLIC POLICY 493

nual Meeting of the American Education Research Asso- Policy Evaluations (C2-RIPE), March 2006. Philadelphia,
ciation. Chicago (April 2125, 2003). Pa.: Campbell Collaboration.
Wilson, Sandra J., and Lipsey, Mark W. 2006. The Effects Wolf, Fredric M. 1990. Methodological Observations on
of School-Based Social Information Processing Interven- Bias. In The Future of Meta-Analysis, edited by Kenneth W.
tions on Aggressive Behavior (Part 1): Universal Programs. Wachter and Miron L. Straf. New York: Russell Sage Foun-
In Campbell Collaboration Reviews of Intervention and dation.
26
VISUAL AND NARRATIVE INTERPRETATION

GEOFFREY D. BORMAN JEFFREY A. GRIGG


University of Wisconsin, Madison University of Wisconsin, Madison

CONTENTS
26.1 Introduction 498
26.2 Displaying Effect Sizes 498
26.2.1 Current Practices 498
26.2.2 Presenting Data in Tables 500
26.2.3 Using Graphs 501
26.2.3.1 Assessing Publication Bias 501
26.2.3.2 Representing Effect-Size Distributions 502
26.2.3.3 Representing Systematic Variation in Effect Size 503
26.2.3.4 The Forest Plot 505
26.2.4 Visual Display Summary 510
26.3 Integrating Meta-Analytic and Narrative Approaches 511
26.3.1 Strengths and Weaknesses 511
26.3.2 Reasons for Combining Meta-Analytic and Narrative Methods 511
26.3.2.1 Describing Results 512
26.3.2.2 Providing Context 513
26.3.2.3 Interpreting the Magnitude of Effect Sizes 515
26.3.3 Integrating Approaches Summary 516
26.4 Conclusion 516
26.5 Notes 517
26.6 References 517

497
498 REPORTING THE RESULTS

26.1 INTRODUCTION thesis can enhance the interpretation of meta-analyses.


We outline three key circumstances under which one
Meta-analyses in the social sciences and education often should consider applying narrative techniques to quantita-
provide policy makers, practitioners, and researchers with tive reviews. First, some of the earliest critics of meta-
critical new information that summarizes the central quan- analysis suggested that the methodology applied a mecha-
titative findings from a particular research literature. How- nistic approach to research synthesis that sacrificed most
ever, meta-analytic articles and reports are also rather of the information contributed by the included studies and
technical pieces of work that can leave many readers glossed over salient methodological and substantive dif-
without a clear sense of the results and their implications. ferences (Eysenck 1978; Slavin 1984). We believe that
In this respect, both visual and narrative interpretations of many contemporary applications of meta-analysis have
research syntheses are important considerations for pre- overcome these past criticisms through more rigorous
senting and describing complex results in more informa- analyses of effect-size variability and modeling of the
tive and accessible ways. substantive and methodological moderators that might
Typically, readers of meta-analyses in the social sci- explain variability in the observed outcomes. However,
ences and education are offered two types of summary we argue that integration of narrative forms of synthesis
tables to help them make sense of the results. First, a table can help further clarify and describe a variegated research
presents the overall summary effect size for the entire literature. Second, there may be evidence that some litera-
meta-analysis. This table may also include mean effect ture, though of general interest and importance, is not ame-
sizes for subgroups of studies defined by important sub- nable to meta-analysis. The typical meta-analysis would
stantive or methodological characteristics of the included simply exclude this literature and focus only on the results
studies (for instance, whether the study employed an ex- that can be summarized using quantitative techniques. We
perimental or nonexperimental design). Reports and arti- believe, though, that such a situation may call for blend-
cles often present a second table that includes a lengthy ing meta-analytic and narrative synthesis so that a more
catalog of all studies included in the meta-analysis, along comprehensive body of research can be considered. Fi-
with such information as their publication dates, the au- nally, beyond descriptive data and significance tests of the
thors names, sample sizes, effect sizes, and, perhaps, dif- effect sizes, we contend that researchers should also use
ferences among the studies along key substantive or meth- narrative techniques for interpreting effect sizes, primar-
odological dimensions such as the ages of participants or ily in terms of understanding the practical or policy im-
the nature of the control group employed. Usually the portance of their magnitudes. In these ways and others,
material in these sorts of tables is not arrayed in a particu- research syntheses that offer the best of both meta-analy-
larly informative order but instead simply sorted chrono- sis and narrative review offer great promise.
logically by publication year or alphabetically by the au-
thors last names. As we show in this chapter, however, 26.2 DISPLAYING EFFECT SIZES
the situation is improving.
Such tables can be useful to describe the individual data 26.2.1 Current Practices
elementsthe outcomes gleaned from each of the studies
included in the meta-analysisand can help present in- To address the question of how the results of meta-analy-
formation about the overall effects observed and how ses are commonly reported in the social sciences and ed-
those effects might vary by interesting methodological ucation, we examined all meta-analyses published in the
and substantive dimensions. However, we assert that data 2000 through 2005 issues of Psychological Bulletin and
from meta-analyses can be presented in ways that facili- Review of Educational Research. We hoped to determine
tate understanding of the outcomes and that communicate both how the presentation of findings may have changed
more information than have been conventionally em- since the last edition of this handbook as well as how the
ployed in the literature. Rather than limiting presentation presentation of findings may differ between these two
and interpretation of meta-analytic data to simple tables journals. Table 26.1 displays the results of this examina-
and descriptive text, we argue for greater consideration of tion and compares them to the previous findings reported
figures and displays. by Richard Light, Judith Singer, and John Willett (1994).
This chapter also examines how the careful integration The first and second columns of the table report the num-
of quantitative and narrative approaches to research syn- ber of meta-analyses published in a given journal and
VISUAL AND NARRATIVE INTERPRETATION 499

Table 26.1 Types of Tabular and Graphic Summaries Used in All Meta-Analyses

Table Figure

Meta- Raw Summary Effect Represents


Analyses Effect Effect Any Size Funnel Other Represents Summary
Volume (n) Size Size Graph Dist. Plot Graphs Uncertainty Effect

Psychological
Bulletin
19851991 74 41 57 14 8 1 8 NA NA
Percentage of Total 55% 77% 19% 11% 1% 11%
Psychological
Bulletin
2000 11 7 10 6 2 1 5 2 1
2001 5 3 4
2002 11 9 11 7 6 3 1
2003 14 10 11 9 6 2 1 1
2004 9 9 8 5 3 4 2 1 1
2005 10 5 9 4 2 1 4 2 1
20002005 60 43 53 31 19 8 15 7 3
Percentage of Total 72% 88% 52% 32% 13% 25% 12% 5%
Review of Educational
Research
2000 2 2 2 1
2001 4 2 4 2 2 1
2002 2 2 1 2 2 1
2003 3 3 3 2 2
2004 4 3 4 3 3 1 1 1
2005 5 4 4 3 1 1 1
20002005 20 16 18 13 10 2 4 0 1
Percentage of Total 80% 90% 65% 50% 10% 20% 0% 5%

source: Authors compilation (Light, Singer, and Willet 1994).

year, columns 3 and 4 summarize the use of tables in the Psychological Bulletin and Review of Educational Re-
articles, columns 5 through 8 report the types of graphs search, we see modest differences between the two jour-
used in the articles, and columns 9 and 10 indicate how nals.
the graph was used. Using tables to present summary findings remains the
Our examination revealed a number of relevant obser- most common method for presenting results, and the
vations. As one might expect from two respected journals 2000 through 2005 totals show that this is becoming a
focusing on publishing research syntheses in the social nearly universal practice. Offering readers a comprehen-
sciences and education, meta-analyses remain prominent sive table that tallies the effect sizes gleaned from each of
in Psychological Bulletin and appear consistently in Re- the studies included in the meta-analysis has also become
view of Educational Research. Compared to the results a common practice. These tables are presented alphabeti-
from Psychological Bulletin articles published from 1985 cally by author name more often than not. Some, though,
through 1991, the more recently published articles from use other organization schemes, including sorting by the
2000 through 2005 have presented data through the use year in which the included studies were published, by the
of a somewhat different assortment of figures. Further, in magnitude of the effect size reported, by methodological
comparing the 2000 through 2005 articles published in features of the studies, or by a key substantive variable.
500 REPORTING THE RESULTS

As with tables, the use of graphs and figures has in- statement that very few [tables] are ordered with the in-
creased since the late 1980s and early 1990s. Only 19 tention of conveying a substantive message may not be
percent (fourteen of seventy-four) of the Psychological as accurate as it once was, at least in the field of psychol-
Bulletin articles from 1985 through 1991 used graphs of ogy (Light, Singer, and Willett 1994, 443). Of the sixty
any kind to present findings, versus 52 percent of Psycho- published meta-analyses we examined in Psychological
logical Bulletin articles from 2000 through 2005. At 65 Bulletin, twenty-nine included tables with effect sizes
percent (thirteen of twenty), the proportion of articles in- listed alphabetically by the first authors last name (three
cluding graphs or figures from the Review of Educational articles listed the effect sizes chronologically, by year of
Research, though from a smaller sample, was even higher. publication). An alphabetic sort can make it easier for
As the results in table 26.1 suggest, graphs have been readers to find a specific study included in the synthe-
used for a number of purposes: representing the distribu- sisor to identify ones that were omitted. Twenty of the
tion of effect sizes across the examined studies by using a articles went beyond this typical strategy to include tables
stem-and-leaf plot or a histogram; showing the relation- sorted by a substantive or methodological characteristic.
ship between effect size and sample sizeoften with a In fact, three articles in Psychological Bulletin included
funnel plot; or displaying the relationship between effect tables ordered by two measures. For example, Elizabeth
size and a substantive or methodological characteristic of Gershoff, in a meta-analysis of corporal punishment and
the examined studiescustomarily with a line graph.1 child behavior, first divided the included studies into
These types of displays are used more frequently in re- those that measured the outcome in childhood and those
cent than in past issues of Psychological Bulletin, as re- that measured the outcome in adulthood (2002). Within
viewed by Richard Light and his colleagues (1994), and each of these categories, the studies were grouped by
are reported with considerable prevalence in both of the child construct, and only then listed alphabetically by au-
journals. thor surname. Providing this structure to the accounting
Graphs and figures can also be used to represent the of the studies included in the review gives the reader a
degree of uncertainty of the reported effect sizes, gener- sense of the research terrain and reveals the analysts ori-
ally with 95 percent confidence intervals, or to represent entation to the data. To easily seek out a particular study,
an overall summary of the effects reported. As the results one need only consult the list of references. To communi-
in the ninth column of table 26.1 suggest, some articles cate multiple classifications to readers, analysts may also
used graphics to represent uncertainty in estimates, though consider coding the included studies for their meaningful
this practice was noted only in the articles published in characteristics and presenting the codes as columnar data
Psychological Bulletin. We were somewhat surprised to in the table.
observe that few articles in either journal took advantage Similar to our discovery that published articles in Psy-
of the opportunity to visually represent the summary ef- chological Bulletin were more likely than those in Review
fect-size estimate. of Educational Research to include graphs that repre-
From the data reported in table 26.1, we suggest that sented uncertainty, our examination of the two journals
graphic displays of some meta-analytic data have become suggested that the tables in Review of Educational Re-
more common and consistent in the social sciences, but search were ordered more conventionallythat is, alpha-
that authors do not routinely take full advantage of the betically by author last name or chronologically by pub-
potential offered by graphic representation. Tables and lication year. Thirteen of the articles sorted the studies by
graphics can be used to represent the findings in powerful the authors name and one article sorted the studies by
ways and to test the assumptions of a study, as we hope to year. Only three, however, used any other kind of sorting
illustrate in the following sections. mechanism, such as the effect size or the substantive
characteristics of the studies. As meta-analyses become
26.2.2 Presenting Data in Tables increasingly ambitious and include longer tables or ap-
pendices of included studies, it would behoove authors to
We concur with Light and his colleagues on the value of organize these data in meaningful and more accessible
presenting raw data to readers and of ordering the data in ways.
a substantively informative fashion (1994). Our survey of Effective tables provide two core functions in meta-
the published meta-analyses in Psychological Bulletin analytic studies. First, they account for the raw data from
and Review of Educational Research suggested that the the original studies examined by the authors, and offer
VISUAL AND NARRATIVE INTERPRETATION 501

the potential to communicate substantive differences be- 2000 M


tween the studies in addition to the effect size gleaned
from each one. Such presentations help readers recognize 1800
the origins of each data point presented in the meta-anal- 1600
ysis, provide an account of the coding assumptions made 1400
by the analyst, and can facilitate replications of the syn-

Sample size
1200
thesis in the future. Well-crafted tables can also describe
overall effects and can document uncertainty through pre- 1000
sentation of the standard errors and the 95 percent confi- 800
dence intervals of the effect sizes. Tables that provide 600
such information are fairly well represented in published
research syntheses, and are far more prevalent than the 400
graphic displays, which we turn to now. 200
0
26.2.3 Using Graphs 1.5 1.0 .5 0.0 .5 1.0 1.5 2.0 2.5
Effect size
In The Commercial and Political Atlas, William Playfair
asserted that from a chart as much information may be Figure 26.1 Funnel Plot of Effect Size Against Sample Size
obtained in five minutes as would require whole days to source: Albarracn et al. 2005.
imprint on the memory, in a lasting manner, by a table of note: Funnel plot. Two effects with extremely large sample sizes
figures (1801/2005, xii). We believe that Playfairs claims were excluded to make the shape of the plot more apparent. These
that graphics can serve as powerful tools for the transmis- large sample groups had average effect sizes.
sion and retention of information are as true today as they
were two centuries ago. Graphics are used for a number
of purposes within the context of meta-analysis, includ- studies, one can determine the relationship between the
ing testing assumptions and serving as diagnostic tools, effect-size estimates and the presumed underlying pa-
representing the distribution of effect sizes, and illustrat- rameter. If bias is not present, the plot takes the form of a
ing sources of variation in effect size data. Graphs can funnel centered on the mean or summary effect size. If
also represent summary effect sizes and their uncertainty, bias is present, then the plot is distorted, either by assum-
although this purpose appears underutilized. Different ing an asymmetrical pattern or by having a void. Some
types of graphs appear to be better suited for particular statistical tests for detecting funnel plot asymmetry are
purposes, but some display types emergent in the social available, namely Matthias Eggers linear regression test
sciences and education literature simultaneously attempt and Colin Beggs rank correlation test, but these have low
to achieve a combination of goals. statistical power and are rarely used in syntheses (Egger
26.2.3.1 Assessing Publication Bias Graphs can be et al. 1997; Begg and Mazumdar 1994). Another similar
used to diagnose whether an analysis complies with the statistical method, the trim-and-fill procedure, can be
assumptions on which it is based, including, most nota- used as a sensitivity analysis to correct for model-identi-
bly, that the effect sizes are normally distributed and are fied bias. As illustrated in Christopher Moyer, James
not subject to publication or eligibility bias. Rounds, and James Hannums study, this method can
If published reports do not provide enough information generate a new summary effect size with a confidence
to derive an effect size and are omitted from a synthesis, interval to reflect possibly missing studies (2004; for a
or if unpublished reports are not adequately represented, discussion of methods for detecting publication bias, see
then the meta-analytic results may be biased. Funnel plots chapter 23, this volume).
of the effect sizes of the included studies are frequently Dolores Albarracn and her colleagues conducted a
used to evaluate subjectively whether the effect sizes from meta-analysis of 194 studies of HIV-prevention interven-
the included studies are, as Light and Pillemer put it, well tions and included a funnel plot that we reproduce as fig-
behaved (1984). ure 26.1 (2005). First we consider the orientation of the
By plotting the effect sizes of the studies against the graphs axes. Funnel plots can be arranged in a number
sample size, standard error, or precision of the included of ways, but in this case the sample size is on the Y-axis
502 REPORTING THE RESULTS

with the largest value at the top, though some authors 4


choose to rank the sample sizes in reverse order, allowing
3

Standardized Effect Size Estimate


for larger studiesthose with the greatest weight and
presumably more reliable estimatesto be closest to the
2
effect size values represented on the x-axis. Other au-
thors choose to put the sample size on the x-axis, result- 1
ing in a horizontal funnel, though a vertical funnel is
conventionally preferred. In any case, the scaleparticu- 0
larly when sample size is reported instead of the standard
error of the estimatesometimes does not gracefully in- 1
clude the largest studies, as is the case here. This prob-
2
lem can be addressed by omitting the observations that
do not fit, as these authors did, or by representing a break 3
in the scale, but one should consider Jonathan Sterne and
Matthias Eggers suggestion that the standard error has 4
optimal properties as the scale representing the size of 3 2 1 0 1 2 3
the study (2001). Albarracn and her colleagues also pro- Expected Normal Value
vided an indication of the overall summary effect size, in
the form of a vertical line marked M (2005). This simple Figure 26.2 Normal Quantile Plot Used to Evaluate Publi-
addition to the figure clearly presents the summary effect cation Bias
size, yet was not common in the articles we reviewed. source: Kumkale and Albarracn 2004.
The plot shown in figure 26.1 suggests that, despite some note: Normal quantile plot of the effect sizes representing the mag-
evidence of a positive skew, these data are indeed well nitude of change in persuasion in discounting-cue conditions from
the immediate to the delayed posttest.
behaved.
If the subjective evaluation of the shape of a funnel
plot does not seem rigorous enough, a normal quantile
plot may also be used to represent potential bias (Wang 26.2.3.2 Representing Effect-Size Distributions In
and Bushman 1998). Tarcan Kumkale and Dolores Albar- addition to testifying to the presence or absence of publi-
racn conducted a meta-analysis of the long-term change cation bias, graphs can efficiently represent the distribu-
of persuasion (a sleeper effect) of individuals who receive tion of effect sizes, particularly the central tendency,
information accompanied by a cue that the source may spread, and shape of the effects. The principal tools to
not be credible (a discounting cue) (2004). Their analysis achieve this are stem-and-leaf plots and box or schematic
included a number of exemplary figures, including the plots. In a stem-and-leaf plot, the first decimal place of
normal quantile plot we reproduce here as figure 26.2. the effect size serves as the stem and the second decimal
The observed standard effect-size estimate is plotted place appears as the leaf. In a box plot or schematic plot,
along one axis against the expected values of the effect the median effect size is the middle of the box, the upper
sizes had they been drawn from a normal distribution. and lower quartiles are the ends of the box, the maximum
The deviation from the normal distribution is represented and minimum effect sizes are the whiskers, and outliers
as the distance of the point from the straight diagonal are represented by single data points placed outside the
line, which is surrounded by two lines that represent the whiskers (for more on these two modes of representation,
95 percent confidence interval. If publication bias is pres- see Tukey 1977, 1988).
ent, the points representing the effect-size estimates will Our examination of recent meta-analyses revealed that
deviate from the normal distribution, will not follow a stem-and-leaf plots have become a popular method for
straight line, and will fall outside of the confidence inter- presenting data. Twenty-nine percent of the articles in
val (Wang and Bushman 1998). Although the figure might Psychological Bulletin and 30 percent of the articles
be improved by presenting both axes on the same range in Review of Educational Research used stem-and-leaf
and scale, figure 26.2 suggests that the meta-analysis suc- displays. A few of these plots used more than one set of
cessfully captured the existing research on their targeted leaves on a given stem to show how the distributions
construct in a relatively unbiased fashion. vary by notable characteristics of the studies. Linnea
VISUAL AND NARRATIVE INTERPRETATION 503

Ehri and her colleagues, for instance, conducted a meta-


analysis of thirty-eight phonics reading instruction ex- Stem Leaf Leaf Leaf
periments and presented the pool of estimates compris- End of End of Followupc
ing the analysis as three side-by-side stem-and-leaf plots Instruction Instruction
(2001), which we present as figure 26.3. The three sepa- (1 yr)a (1 year)b
rate leaves off the central stem in this figure correspond
3.7 1
to differences among the studies in terms of the timing of
2.2 7
the posttest after the phonics intervention. By virtue of
2.1
the method of presentation, it is indeed apparent that
2.0
most of the instruction experiments lasted less than one
1.9 9
year and were associated with heterogeneous effects,
1.8
that longer exposure to instruction was associated with
1.7
more consistently positive effects, and that the follow-up
1.6
estimates provide evidence of lasting effects (Ehri et al.
1.5
2001, 414). The authors then examined potential moder-
1.4 12
ating variables to account for the heterogeneity in short-
1.3
term effects.
1.2
Richard Light and his colleagues had urged meta-ana-
1.1 9
lysts to consider using stem-and-leaf and schematic plots
0.9 1
to present the distribution of effect sizes (1994). To some
0.8 4
extent, this advice has been followed, as stem-and-leaf
0.7 0236 5 6
plots have become more prevalent in recent published
0.6 001233 47
syntheses. Schematic plots, on the other hand, are still
0.5 00133 24 6
quite uncommonwe observed only one in the eighty
0.4 345789
articles we examined.
0.3 23367889 6 238
26.2.3.3 Representing Systematic Variation in Ef-
0.2 014457 48 8
fect Size An emerging and effective practice is to use
0.1 23469 7
graphs to illustrate how effect sizes vary according to sub-
0.0 01344479 0
stantive issues or methodological features of the studies
0.0 7
constituting the meta-analysis. These findings are charac-
0.1 1
teristically presented as summary tables, but can also be
0.2 05
effectively presented graphically. Parallel schematic plots,
0.3 3
bar graphs, and scatterplots can be used. The choice of
0.4 7 7
format depends to a great extent on the type of variable
that is the source of variation in the outcome.
Figure 26.3 Stem-and-Leaf Plot of Effect Sizes of System-
To present the influence of a categorical variable, a
atic Phonics instruction on Reading by Timing of Posttest
parallel schematic plot (also known as a box and whisker
source: Ehri et al. 2001. Reprinted by permission of SAGE
plot) or a histogram can be used. A meta-analysis by Eric Publications.
Turkheimer and Mary Waldron of the research literature note: Stem-and-leaf plot showing the distribution of mean effect
on nonshared environment, which offers one possible ex- sizes of systematic phonics instruction on reading measured at the
planation for how siblings can have remarkably different end of instruction and following a delay.
a
outcomes from one another, presented the effect sizes End of instruction or end of Year 1 when instruction lasted longer.
b
End of instruction which lasted between 2 and 4 years.
from the primary studies in a number of parallel sche- c
Followup tests were administered 4 to 12 months after instruction
matic plots, including one organized by the source of the ended.
data: observation, the parent, the child, or an aggregate of
measures (2000). The graph, which we present here as This figure shows that correlations using observer reports
figure 26.4, reveals both the variation in effect sizes by are high and variable, that parental estimates are lower,
reporter andby virtue of a single line connecting the that estimates based on child reports were higher, and that
median pointsthe difference in medians across types. studies using reports of a childs environment from more
504 REPORTING THE RESULTS

0.25 1.2 UC & SET


*** UC or SET
1.0 ***
0.8 Neither

Effect size (d)


0.20 0.6 *** *
0.4
0.2
0.15 0.0
rn2 0.2
0.4
0.10 0.6 020 Min 2140 Min 41 60 Min
0.8
Poststressor Poststressor Poststressor
0.05 Figure 26.5 Mean Cortisol Effect Size for Performance
Tasks with Both Social-Evaluative Threat (SET) and Uncon-
trollability (UC), Either SET or UC, and Neither by Timing
0.00 of Poststressor Measurement
Observation Parent Child AGG
Environmental Reporter source: Dickerson and Kemeny 2004
Note the effective use of shading as well as the error bars and p-
value indicators.
Figure 26.4 Parallel Schematic Plot of Median Effect Size note: Mean ( SEM) cortisol effect size (d) for performance tasks
by Environmental Reporter with both social-evaluative threat (SET) and uncontrollability (UC),
source: Turkheimer and Waldron 2000. performance tasks with either SET or UC, and performance tasks
note: Median effect size by environmental reporter. AGG indicates without either component or passive tasks (Neither) during intervals
aggregate of more than one reporter. (See figure 26.5 caption for ex- 020, 2140, and 4160 min poststressor. *p .05. ***p .001.
planation of box and whisker plots.)

as in figure 26.6. In a meta-analysis of racial comparisons


than one source revealed correlations that were low but of self-esteem, Bernadette Gray-Little and Adam Hafdahl
the least variable of all. presented this scatterplot of the effect size against the av-
Bar charts or histograms can also be effectively used to erage age in years of the participants in the studies (2000).
illustrate systematic sources of variation, as in figure The varying sizes of the points show the relative weights
26.5. Sally Dickerson and Margaret Kemeny, in an ex- of the studies, and the linear regression line reveals the
amination of cortisol responses to stressors, presented trend between effect size and age. This graph efficiently
their findings in a diagram, which is organized simultane- communicates to the reader that the majority of studies
ously by two measures: the type of outcome that was and especially those granted the most weight in the anal-
measured and the time lag between the stressor and the ysisincluded children between the ages of ten and
outcome measurement (2004). The erosion of the effect eighteen, that both positive and negative effects were ob-
over time is immediately evident, as are the varying effect served, and that more positive effects sizes were reported
sizes graphed within each of the three categories of post- in studies with older participants.
stressor lag. Across all three categories of poststressor Some more recently developed graphical techniques
lag, when both social-evaluative threat and uncontrolla- endeavor to accomplish a number of purposes at once.
bility tasks are set, the cortisol level is considerably higher Two notable formats are the radial plot and the forest plot,
than when either or none of the conditions are present. which has sometimes been referred to as a confidence in-
The error bars and the asterisks indicating, respectively, terval plot (Galbraith 1988a, 1988b, 1994).2 The radial
uncertainty and the statistical significance of the findings plot is a highly technical figure that can sometimes pose
offer an added degree of sophistication and clarity to the a challenge to the reader, but its virtue and application to
figure with the addition of a few simple visual elements. meta-analysis is that it explicitly compares estimates that
The preceding examples consider categorical substan- have different precisions. Two essential pieces of infor-
tive variables, but the influence of continuous moderator mation are needed to create a radial plot: each estimates
variables can also be effectively represented using graphs, precision, which is calculated as the inverse or reciprocal
VISUAL AND NARRATIVE INTERPRETATION 505

1.2 standardized estimate is equivalent to the 95 percent con-


fidence interval. The data are odds ratios for fatal coro-
0.8 nary disease; negative estimates or ratios less than one are
associated with reduced mortality. Two studiesidenti-
0.4 fied as eleven and twenty-onereport significant reduc-
Effect Size

tions in mortality, but their precision is on the lower end


0.0
of the distribution. One can see that though most of the
0.4
diet studies report reduced mortality, none are especially
precise, and that the drug studies farther to the right are
0.8 more equivocal. One can recover the original odds ratio
estimates from the circular scale to the right, but the
1.2 strength of this method is to represent the relative weights
2 6 10 14 18 22 and values of the estimates and to distinguish the amount
Age (years) of variation in the sample.
The radial plot is a powerful way to represent data and
Figure 26.6 Scatterplot of Effect Size by Average Partici- is favored among some statisticians, but for the uniniti-
pant Age in Study ated it is not as immediately evident in the same way that
source: Gray-Little and Hafdahl 2000 other graphic representations can be (see, for example,
note: Scatterplot of Hedgess d against average participant age in Whitehead and Whitehead 1991). Furthermore, though
the study. Effect size (ES) weights are restricted to 700. Symbol area the radial plot effectively represents variation in estimates
is proportional to ES weight. Dotted horizontal line represents no
effect (d 0.0). Solid line represents weighted linear regression and their relation to their standard errors, this emphasis
equation for all independent ESs with available data (k 257). seems to come at the expense of clearly representing a
summary measurefor example, the absolute efficacy of
the cholesterol-lowering drugs is not evident in figure
of its standard error (xi 1/i), and the standardized ef- 26.7. Another form of representation, known generally as
fect estimate (yi zi /i). These data are plotted against the forest plot, combines images with text to present
one another to show how the standardized deviated esti- meta-analytic information in a more accessible form,
mates vary according to their precision, with the more though we grant that the variation in standard errors is not
precise estimates placed further from the origin. The ra- presented as precisely as in a radial plot.4
dial element is introduced to recover the original zi esti- 26.2.3.4 The Forest Plot If sheer frequency of use is
mates by extrapolating a line from the origin through the a suitable measure of acceptance, then the display known
(xi , yi) point to the circular scale (Galbraith 1988a). Most as the forest plot is the new standard for the presentation
earlier applications use a radial plot to summarize odds- of meta-analytical findings. The large volume of existing
ratio results from clinical trials (see Galbraith 1988b, forest plots is related in large part to its widespread adop-
1994; Whitehead and Whitehead 1991; Schwarzer, Antes, tion in medical research, particularly as represented in the
and Schumacher 2002).3 Cochrane Collaborations Cochrane Library, which has
Rex Galbraith applied a radial plot to meta-analytical published thousands of syntheses of medical trials that
data, which we show here as figure 26.7, and which is in employ forest plots.5 Although the origin of the term for-
many ways a remarkable figure (1994). First note that the est plot is uncertain (Lewis and Clarke 2001), J. A. Frei-
studies included in the meta-analysis are visually coded man and colleagues were the first to display the point es-
in two ways: by the numbers identifying the individual timates of studies with horizontal lines to represent the
studies at the top and by the substantive characteristics confidence interval (1978). Lewis and Ellis introduced
drug versus diet and primary versus secondaryrepre- the summary effect measure at the bottom of the plot for
sented by the shape and coloring of the icons. This coding use in a meta-analysis (1982; Lewis and Clarke 2001).
scheme, though shown in the context of a radial plot, The forest plot has not yet been widely adopted in the
would be beneficial in any form of visual representation. social sciences, however. The Campbell Collaboration, the
The data are presented in the middle of the figure with the emerging social sciences counterpart to the Cochrane Col-
inverse of the standard error as the x-axis and the stan- laboration, frequently includes syntheses with forest plots,
dardized estimate as the y-axis; a two-unit change in the but its collection does not universally use the practice.6
506 REPORTING THE RESULTS

3 odds
18 5 9a ratio
1 2 9b
22 2.5
Trial: 10 4 9c
4 7 17 16 9d
20 8 9e
15 12 6
11 21 2
1.75
1.5
standardised
estimate y 1.25
2
1.1
0 1
0.91
2
0.8

drug, primary 0.67


drug, secondary
diet, primary 0.57
diet, secondary
0.5

0.4
% relative s.e. 50 30 20 15 10
0.33
0 5 10
precision x 0.25

Figure 26.7 Radial Plot of Fatal Coronary Heart Disease Odds Ratios
source: Galbraith 1994. Reprinted with permission from the Journal of the American Statistical
Association. 1994. All rights reserved.
note: A Radial Plot of Fatal Coronary Heart Disease Odds Ratios (Treated / Control) for 22 Choles-
terol Lowering Trials Using zi and i Given by (14). Data from Ravnskov (1992, table 1).

Notably, not one of the meta-analyses published from often varying in proportion to the weight the individual
2000 to 2005 in either Psychological Bulletin or Review study is accorded in the analysis. At the bottom of the fig-
of Educational Research included a forest plot. Although ure, a diamond represents the overall summary effect size.
it was not called a forest plot at the time, Light and his The center of the diamond shows the point estimate of the
colleagues (1994) offered such a display by Salim Yusuf summary effect size, and its width represents the limits of
and his colleagues (1985) as an exemplary method to its 95 percent confidence interval. A vertical line extend-
present the findings from a meta-analysis, and encour- ing through the figure can represent the point estimate of
aged its application in the social sciences. We hope to the summary effect size or the point of a null effect. In
reinforce that suggestion. either case, one can compare how the effect sizes across
A forest plot combines text with graphics to display the studies in the meta-analysis are distributed about the
both the point estimates with confidence intervals as well point of reference. A plotted effect size is statistically dif-
as an estimate of the overall summary effect size. As a ferent from the overall or null effect if the diamond repre-
result, it simultaneously represents uncertainty and the senting it does not intersect the vertical line in the figure.
summary effect and indicates the extent to which each Figures 26.8 and 26.9 show how the data from the
study contributes to the overall result. The forest plot fol- teacher expectancy meta-analysis used in this volume
lows a number of conventions. The point estimate of each (see appendix) can be displayed using a forest plot. Fig-
study is represented as a square or a circle, with its size ure 26.8 shows a forest plot we created using the user-
VISUAL AND NARRATIVE INTERPRETATION 507

Rosenthal et al. 1974


Conn et al. 1968
Jose & Cody 1971
Pellegrini & Hicks 1972 [1]
Pellegrini & Hicks 1972 [2]
Evans & Rosenthal 1968
Fielder et al. 1971
Claiborn 1969
Kester 1969
Maxwell 1970
Carter 1970
Flowers 1966
Keshock 1970
Henrikson 1970
Fine 1972
Greiger 1970
Rosenthal & Jacobson 1968
Fleming & Anttonen 1971
Ginsburg 1970

Combined

.75 .5 .25 0 .25 .5 .75 1 1.25 1.5 1.75 2


Effect Size
Figure 26.8 Forest Plot of Effect Sizes Derived from Eighteen Studies of Teacher Expectancy Effects on Pupil IQ
source: Authors compilation.

written metagraph command in Stata.7 Note that the fig- tant moderator variable: the estimated weeks of teacher-
ure includes the name of each study, a representation of student contact prior to expectancy induction. Figure
the point estimate, its 95 percent confidence interval, and 26.9, which takes advantage of this additional substantive
a dotted vertical line at the overall combined mean effect information, was produced using Comprehensive Meta
estimate. In a single figure, one can present the summary Analysis Version 2.8
effect size, its uncertainty, and its statistical significance, Figure 26.9 gains from the forest plots potential to
the distribution of effects across the included studies, the serve as both a figure and a table. One can elect to present
relative contribution of each study to the overall effect a number of statistics in the middle columns. In this case,
estimate, and the degree of uncertainty for each of the we have presented the value of the moderator variable,
individual effect estimates from the studies. The size of the standardized effect size, the standard error, and the
the squares represents the weight of each effect estimate, p-value of the estimate. In both figures, the left-hand tip
which tends to counteract the visual attention commanded of the diamond representing the summary effect crosses
by the estimates with large confidence intervals. In this the zero point, indicating that the overall effect is not sta-
case, the inverse variance weights are used. tistically different from zero. More important, the order-
The data in figure 26.8 are sorted as originally pre- ing of these data illustrates the relationship between the
sented (Raudenbush and Bryk 1985). The order, in and of teachers previous experience with their students and the
itself, does not offer much interpretative value and cer- standardized mean differences between the treatment and
tainly does not suggest a systematic relationship with an control groups. That is, when teachers had more previ-
underlying construct. However, these data offer an impor- ous experience with students, the researchers efforts to
508 REPORTING THE RESULTS

Statistics for each study

Standardized
Prior difference Standard
Study name Weeks in means error p Standardized difference in means and 95% Cl

Pellegrini & Hicks 1972 [1] 0 1.180 0.373 0.002


Pellegrini & Hicks 1972 [2] 0 0.260 0.369 0.481
Kester 1969 0 0.270 0.164 0.100
Carter 1970 0 0.540 0.302 0.074
Flowers 1966 0 0.180 0.223 0.420
Maxwell 1970 1 0.800 0.251 0.001
Keshock 1970 1 0.020 0.280 0.945
Rosenthal & Jacobson 1968 1 0.300 0.139 0.031
Rosenthal et al. 1974 2 0.030 0.125 0.810
Henrikson 1970 2 0.230 0.290 0.428
Fleming & Anttonen 1971 2 0.070 0.094 0.456
Evans & Rosenthal 1968 3 0.060 0.103 0.560
Greiger 1970 5 0.060 0.167 0.719
Ginsburg 1970 7 0.070 0.174 0.687
Fielder et al. 1971 17 0.020 0.103 0.846
Fine 1972 17 0.180 0.159 0.258
Jose & Cody 1971 19 0.140 0.167 0.402
Conn et al. 1968 21 0.120 0.147 0.414
Claiborn 1969 24 0.320 0.220 0.146
Overall 0.060 0.036 0.698
2.00 1.00 0.00 1.00 2.00
Favors Control Favors Treatment

Figure 26.9 Forest Plot of Effect Sizes Derived from Eighteen Studies of Teacher Expectancy Effects on Pupil IQ Sorted by
Weeks of Teacher-Student Contact Prior to Expectancy Induction
source: Authors compilation using data from Raudenbush and Bryk 1985.
note: Prior Weeks are estimated weeks of teacher-student contact prior to expectancy induction.

influence the teachers expectations tended to be less ef- ability, either covariate adjustment or random assignment.
fective. This single plot illustrates the degree of uncer- The method of ability control was a source of systematic
tainty for each study, the contribution of each study to the variation across the primary studies and can be clearly
overall estimate, the overall estimate itself, and the rela- illustrated in this graphic representation.
tionship between an important moderator variable and the The fourteen covariate-adjustment studies are grouped
effect size. at the top of the display and followed by a row that pres-
As another example of the use of the forest plot, we use ents the mean effect estimate and confidence interval for
the meta-analytic data from Peter Cohen on the correla- this subgroup of studies. The random assignment studies,
tions between student ratings of the instructor and student in turn, are grouped with their own subgroup estimates.
achievement (1981, 1983), which we show as figure 26.10. At the bottom of the figure, the overall effect size and its
This figure was also produced using Comprehensive Meta- confidence interval are shown. In addition to portraying
Analysis Version 2. In this case, the moderator variable is the particular outcomes of each study included in the meta-
categorical, so we group the data according to the method analysis, the figure demonstrates that the effect sizes for
used to manage pre-intervention differences on student both subgroups of studies and for the overall summary
VISUAL AND NARRATIVE INTERPRETATION 509

Statistics for each study


Group by
Ability Lower Upper
Control Study name Sections Correlation limit limit p Correlation and 95% CI

Covariate Bryson 1974 20 0.560 0.156 0.803 0.009


Adjustment Centra 1977 [1] 13 0.230 0.368 0.693 0.459
Centra 1977 [2] 22 0.640 0.299 0.836 0.001
Crooks & Smock 1974 28 0.490 0.143 0.730 0.007
Doyle & Crichton 1978 12 0.040 0.600 0.546 0.904
Doyle & Whitely 1974 12 0.490 0.117 0.830 0.108
Elliot 1950 36 0.330 0.002 0.594 0.049
Frey, Leonard & Beatty 1976 12 0.180 0.439 0.683 0.585
Greenwood et al. 1976 36 0.110 0.423 0.227 0.526
Hoffman 1978 75 0.270 0.046 0.468 0.019
McKeachie et al. 1971 33 0.260 0.091 0.554 0.145
Marsh et al. 1956 121 0.400 0.239 0.540 0.000
Remmer et al. 1949 37 0.490 0.197 0.702 0.002
Wherry 1952 20 0.160 0.304 0.563 0.506
Covariate 0.339 0.253 0.419 0.000
Subtotal
Random Bolton et al. 1979 10 0.680 0.088 0.917 0.028
Assignment Ellis & Richard 1977 19 0.580 0.171 0.819 0.008
Sullivan & Skanes 1974 [1] 14 0.510 0.028 0.819 0.062
Sullivan & Skanes 1974 [2] 40 0.400 0.101 0.633 0.010
Sullivan & Skanes 1974 [3] 16 0.340 0.187 0.715 0.202
Sullivan & Skanes 1974 [4] 14 0.420 0.142 0.777 0.138
Random 0.465 0.293 0.607 0.000
Subtotal
Overall 0.363 0.287 0.434 0.000
2.00 1.00 0.00 1.00 2.00
Favors Control Favors Treatment

Figure 26.10 Forest Plot of Validity Studies Correlating Instructor Ratings and Student Achievement Grouped by Ability
Control
source: Authors compilation using data from Cohen 1983.
note: Cohen (1983); studies with sample sizes less than ten were excluded.

effect estimate are positive and statistically different from can enhance the potential meaning conveyed by the fig-
zero. Also, one can see that the effect size for the random- ure by ordering the studies according to a single underly-
assignment studies is somewhat larger than the estimate ing scale or by grouping them by a pertinent substantive
for the studies using covariate adjustment. The 95 percent or methodological characteristic. In addition to the mod-
confidence intervals for the effect sizes for random-as- erating variables we discussed earlier, one could similarly
signment and covariate-adjustment studies overlap, how- order studies by number of participants, dose of interven-
ever, suggesting that the difference in ability control is tion, or other pertinent variables.
substantively, but not statistically, significant. These illustrations highlight the utility of the forest
The same principles that apply to ordering studies in plot format. The format, though, does have its limitations.
tables can also be fruitfully applied to ordering them in Other forms of representation, such as the funnel plot or
figures, namely forest plots. Depending on the data, one the radial plot, address publication bias assumptions or
510 REPORTING THE RESULTS

effect size variations especially well. Also, a different seem like a morass of data (1994, 451). They also made
form of representation may be more appropriate for an some sound recommendations for producing high-quality
especially large meta-analysis, particularly to portray the and informative figures. We echo many of their points
overall variation in effect sizes across the studies. For ex- and add some suggestions of our own.
ample, Geoffrey Borman and his colleagues used a stem- The Spirit of Good Graphics Can Even Be Captured in
and-leaf plot to show the distribution of 1,111 effect sizes a Table of Raw Effect Sizes, Providing They Are Presented
gleaned from the comprehensive school reform literature in a Purposeful Way. The majority of published meta-
(2003). A forest plot with that many observations would analyses we examined presented catalogues of raw effect-
clearly not be feasible. One could, however, divide the size data or summary effect-size information. A compre-
numerous observations along key substantive or method- hensive table that presents the meta-analytic data gleaned
ological lines and present them in a single summary fig- from each of the studies included in the meta-analysis is a
ure that displays the outcomes for the constituting compo- useful feature that can facilitate potential replication and
nents. Subsequent figures could elaborate on each of the reveal the attributes of the individual studies that lie be-
parts that comprise the whole. Clarity and accuracy are hind the overall summary statistics. Although the promi-
paramount, and an especially large dataset may call for nence of these tables in recently published studies is well
other measuressuch as multiple plots, a stem-and-leaf deserved, we contend that they are not always used to
plot, or a box plotto be most instructive to the reader. their full potential. Rather than simply presenting the re-
sults from the included studies in alphabetical order by
author name, tables can often be ordered more meaning-
26.2.4 Visual Display Summary fully and instructively, such as by the effect sizes or by
the value of a key substantive or methodological modera-
In the context of a meta-analysis, figures can serve diag-
tor variable. Such reordering often can convey much
nostic purposes, such as assessing publication bias. They
more useful information than conventional alphabetical
can convey information about the individual attributes of
or chronological listings. These general principles of use-
studies that provide the data points of the analysis such as
ful ordering of effect size data can also be profitably ap-
their standard error. They can provide information re-
plied to forest plots, given their similarity to tables.
garding the central tendency of the effect-size estimate,
Graphic Displays Should be Used to Emphasize the
its statistical significance, and its variability. They can
Big Picture. Authors making effective use of graphics
show how differences in effect size are related to system-
should start by making clear the main take-home mes-
atic differences across the studies. They can display pow-
sage of the meta-analysis. This can be accomplished by
erful combinations of meta-analytic information. Used
displaying the summary effect size and its variability, or
thoughtfully, displays can translate complex data from
can demonstrate the importance a moderator plays in
research syntheses to help researchers, policy makers,
shaping the results across the studies. Generally, stem-
practitioners, and lay audiences alike see and understand
and-leaf plots and schematic plots are especially effective
the core outcomes from the work. Indeed, with the grow-
for presenting the big picture. Both figures provide an in-
ing demand for quantitative evidence to inform social,
tuitive display of the distribution of a large number of
educational, and health policy, efficient and effective pre-
effect sizes and they facilitate identification of outliers.
sentation of meta-analytic data is likely to become in-
Once the larger picture is established, the reader can pur-
creasingly important. As Playfair suggested:
sue the more detailed questions. Forest plots with sum-
mary effect-size indicators can depict the big picture and
Men of high rank, or active business, can only pay attention
to general outlines; nor is attention to particulars of use, any convey many of the smaller details.
further than as they give a general information; it is hoped Graphics Are Especially Valuable for Communicating
that, with the assistance of these Charts, such information Findings from a Meta-Analysis When the Precision or
will be got, without the fatigue and trouble of studying the Confidence of Each Studys Outcome is Depicted. A
particulars of which it is composed. (1801/2005, xivxv) number of the formats we have reviewed address the
issue of the error inherent in any given study. Whether
More succinctly, Richard Light and his colleagues using a standard error scale in a funnel plot, a precision
noted that simple graphic displays of effect sizes in a scale in a radial plot, or a confidence interval bar in a for-
meta-analysis can impose order onto what at first may est plot, recent developments in the representation of
VISUAL AND NARRATIVE INTERPRETATION 511

meta-analytic findings expressly present both the esti- fined process for selecting the primary studies and an ac-
mate offered by a single study as well as its error. cepted methodology for generating summary statistics, a
Graphic Displays Should Encourage the Eye to Com- large number of empirical studies in a given substantive
pare Different Pieces of Information. When important area can effectively be reduced to a single estimate of the
moderators influence the outcomes across the studies in overall magnitude of an effect or relationship. Two re-
the meta-analysis, when distributions of effect sizes found viewers using the same meta-analytic procedures should
in distinct subgroups of studies appear quite different, and develop the same statistical summary. If they use differ-
when different analyses of the data reveal variable out- ent techniques, the descriptions of the methods should
comes, graphic displays can effectively portray these dis- make the differences explicit. When two reviews differ in
tinctions. The side-by-side graphical representations of their results and conclusions, the reader will understand
two or more subgroups of studies or from several analy- why. Narrative reviews, on the other hand, can describe
ses of the same sample of studies can make highly techni- the results from both quantitative and qualitative studies
cal issues more intuitive and clear to the reader. in a given area, easily discuss the various caveats of the
The Selection of Graphical Representation Should be individual studies, and trace the overall evolution of the
Informed by Ones Purpose and Data. Just as the type of empirical and theoretical work in the area. What narrative
variable in question may inform the decision to use a line reviews lack in terms of parsimony, precision, and repli-
graph or a bar graph, the graphs we have reviewed in this cability, they can provide in terms of rich and careful de-
chapter perform differently under different conditions. scription of the various methodological and substantive
For example, whether the effect is expressed in terms of differences found across a research literature.
odds or as a correlation may influence the form of repre-
sentation one chooses. Funnel plots aptly represent the 26.3.2 Reasons for Combining Meta-Analytic
distribution of effects and are well served when called on and Narrative Methods
to address questions of publication bias. Stem-and-leaf
and schematic plotseither alone or in multiple parallel As Harris Cooper and his colleagues suggested, an effec-
plotsportray the overall results of a meta-analysis and tive research synthesis should include both analytic and
compare key subgroups of studies efficiently. They are synthetic components (2000). It should make sense of the
particularly useful when considering an especially large literature analytically by identifying theoretical, method-
group of studies. A forest plot can also illustrate the un- ological, and practical inconsistencies found across the
certainty of each effect estimate, the contribution of each primary studies and should describe the evolution of the
study to the overall estimate, the overall estimate itself, empirical literature and help the reader understand how
the relationship between a relevant variable and the effect the current research synthesis advances that evolution.
size, and some of the information presented in the often- These analytical contributions may take the form of nar-
used table of primary studies. rative descriptions, but they might also suggest empirical
We hope that by presenting this review of the strengths assessments of how such inconsistencies in the literature
and limitations of a comprehensive set of graphic ap- contribute to differences in the observed effect sizes. The
proaches to portraying meta-analytic data we will have synthetic component of the research synthesis draws to-
helped readers make informed decisions about how best to gether all evidence in the domain under investigation to
communicate their results (for further reading on the prin- provide parsimonious and precise answers to questions
ciples and history of statistical graphics, see Bertin 1983; about the overall effects found across the included studies
Cleveland 1994; Harris 1999; Tufte 1990, 1997, 2001, and about how the various methodological, theoretical,
2006; Wainer 1984, 1996; Wainer and Velleman 2001). and practical differences from the studies moderate the
observed outcomes.
26.3 INTEGRATING META-ANALYTIC Several scholars have called for narrative and meta-
AND NARRATIVE APPROACHES analytic methods to be integrated and some have pro-
duced syntheses that have embodied characteristics of
26.3.1 Strengths and Weaknesses both traditions. Indeed, such discussions have taken place
since the late 1970s and early 1980s when meta-analysis
Among the key strengths of meta-analysis are parsimony, was a nascent field. Most notably, Richard Light and
precision, objectivity, and replicability. Using a well-de- David Pillemer argued that reviews organized to ally both
512 REPORTING THE RESULTS

qualitative and quantitative forms of information would and describing meta-analytic findings. Rather than re-
ultimately maximize knowledge (1982). Thomas Cook stricting their synthesis to a somewhat homogeneous col-
and Laura Leviton took a similar stance: lection of the best evidence, as Slavin advised (1986), the
researchers applied narrative description and meta-ana-
Science needs to know its stubborn, dependable, general lytic integration to the larger literature. In this case, the
facts, and it also needs data-based, contingent puzzles that authors illustrated how the analyst can consider all of the
push ahead theory. Our impression is that meta-analysts evidence and, using both statistical and narrative exposi-
stress the former over the latter, and that many qualitative tion, identify those studies that provide the best answer to
reviewers stress the latter more than the former (or at least
the broader synthetic question of the overall effect of
more than meta-analysts do). Of course, neither the meta-
analyst nor the qualitative reviewer needs to make either summer school, as well as respond to the analytical chal-
prioritization. Each can do both and any one reviewer can lenge of identifying the features of summer programs
consciously use both qualitative and quantitative techniques most consistently associated with improved student out-
in the same review. Indeed, s/he should. (1980, 468) comes.
Though both methods are good examples of existing
Beyond this rhetoric, there are clear examples in the frameworks for integrating meta-analytic and narrative
literature of researchers who have since developed the reviews, we contend that there has been little direct guid-
mixed-method approaches these earlier authors advo- ance to help researchers understand the rationales and
cated. Robert Slavin, for instance, has suggested an alter- methods for achieving this integration. In the sections
native form of research synthesis that he calls best-evi- that follow, we describe three central reasons why the
dence synthesis (1986). This method uses the systematic analyst may choose such an approach to research synthe-
inclusion criteria and effect-size computations typical of sis and provide examples of how one may accomplish it.
meta-analyses, but discusses the findings of critical stud- 26.3.2.1 Describing Results The first notable prob-
ies in a form more typical of narrative reviews. Slavin has lem one might encounter when attempting to synthesize
applied this method to the synthesis of a number of edu- a body of evidence is that not all studies are necessarily
cational practices and interventions, including, most re- amenable to traditional forms of meta-analysis. If one
cently, the language of instruction for English language encounters an important study, or set of studies, that do
learners (Slavin and Cheung 2005). He proposed the not provide the details necessary to compute an effect
method largely in response to the problem of mixing ap- size, what is the suggested course of action? What about
ples and oranges some critics of meta-analysis had voiced. studies that provide descriptive statistics or qualitative
The best evidence approach avoids aggregating findings description and no effect-size data? Typically, such stud-
of varying phenomena and methodology and focuses on ies are omitted and may remain unreported by the re-
only the best evidence to address the synthetic questions. searcher, regardless of their importance and potential
Further, narrative exposition details the results of the in- contributions to understanding the phenomena under re-
cluded studies so that the reader gains a better apprecia- view. We maintain, however, that these studies might
tion for the context and meaning of the synthetic findings. offer important information that can and should be in-
This method also seems especially relevant when dis- cluded in research syntheses, of which a meta-analysis is
cussing the relative effect sizes of various interventions a chief element.
that might be of interest to policy makers and practitio- Narrative description of some research that is not ame-
ners. Clear description of the programmatic methods be- nable to meta-analysis can help the reader understand the
hind the summary measures can enable the practices that broader literature in an area in a way that supplements or
produced positive results in previous trials to be repli- surpasses the overall mean effect size. First, in an area
cated and the proper context to supporting implementa- with limited research evidence, discussion of these studies
tion of the intervention to be configured. and presentation of some synthetic information regarding,
A second example comes in a monograph describing for instance, the observed direction of effects, can offer
the meta-analytic results from a synthesis of various types the reader some preliminary and general ideas about the
of summer school programs to promote improved student past findings. It does not offer the same precision of a
outcomes, especially achievement (Cooper et al. 2000). weighted effect size, but in combination with other evi-
In this monograph, Cooper and his colleagues illustrated dence may be valuable and informative. For instance, the
the use of a combination of novel methods for conducting Cooper and colleagues synthesis of the summer school
VISUAL AND NARRATIVE INTERPRETATION 513

literature included information gleaned from thirty-nine studies, simply choose to omit large portions of the litera-
reports that did not include enough statistical data to cal- ture. How can the analyst contend with these differences
culate an effect size (2000). The authors, however, pro- but also take advantage of this natural variation across
ductively included this research by generating informa- studies to describe how methodological and substantive
tion concerning the general direction of the achievement differences may shape the outcomes?
effects observed in the primary studies (in this case, all On one hand, Gene Glass maintained that it is an em-
positive, mostly positive, even, mostly negative, and all pirical question whether relatively poorly designed stud-
negative). This directional information reinforced the ies give results significantly at variance with those of the
meta-analytic finding that summer school positively in- best designed studies (1976, 4). On the other, Slavin
fluences student achievement. Although such vote-count- argued that far more information is extracted from a
ing procedures can be misleading on their own (Hedges large literature by clearly describing the best evidence on
and Olkin 1980), by combining careful narrative descrip- a topic than by using limited journal space to describe
tion with vote-count estimates of effect size, the authors statistical analyses of the entire methodologically and
were able to assemble enough analytical and synthetic substantively diverse literature (1986, 7). Should the
details to provide evidence about overall effects and the researcher combine studies that used varying methods
potential moderators of those effects. The narrative de- and are characterized by varying substantive characteris-
scription of these results can also identify the limitations tics or should one focus only on what is considered the
of the previous research base and suggest new directions best evidence?
researchers might take to improve the rigor of future In truth, the answer probably depends most on the
studies, so that traditional meta-analytic procedures can meta-analysts research questions. If one wants to know
be applied to new work in the area. the effect of an intervention for a specific period and
Detailed narrative descriptions of the specific interven- based on data from only the most rigorous experimental
tions or phenomena under consideration can also facili- studies, the analyst should choose the small number of
tate future replications of the synthesized effects by both studies that fit that description. However, if one were in-
practitioners and researchers. Describing exemplary prac- terested in studying how much the effect varied depend-
tices used in particular studies can help explain differ- ing on treatment duration, and in assessing the difference
ences in the effects observed and can open the black box in effect estimates from experimental and quasi-experi-
to provide practitioners and policy makers with clear sug- mental studies, then the analyst can learn considerably
gestions of how they might improve related policies and more by systematically investigating the variance in ef-
programs. fect magnitudes across a group of studies that differ along
26.3.2.2 Providing Context In some circumstances, these dimensions.
authors are faced with the challenge of making sense of By integrating narrative and quantitative techniques
the results gleaned from studies included in a given sub- for research synthesis, we believe that authors can clearly
stantive area that differ in terms of many attributes, in- outline and discuss the complexities of a literature, and
cluding the: publication type (journal article, dissertation, describe how the known differences across the studies
external evaluation, or internal evaluation); methods (pre- relate to variation in the effect estimates. In this way, au-
test-posttest comparison, experimental comparison, or thors can productively use the variability across studies to
nonequivalent control-group design); samples (at-risk, qualify the conditions under which effects are larger or
average, or gifted students); actual characteristics of the smaller, and, through synthetic analyses of moderators of
phenomena or interventions (academic remediation or effect size, can quantify precisely how substantial the dif-
enrichment program); and indicators of the outcomes of ferences in the outcome are.
the phenomena or interventions (teacher-constructed as- Two recent applications of this approach are notewor-
sessments or standardized tests). Differences such as these thy. First, in their meta-analysis of the achievement ef-
are commonly found in the social sciences and in educa- fects of summer school programs for K12 students, Har-
tion, and pose methodological challenges for the analyst. ris Cooper and his colleagues assumed that the summer
Some choose to pool findings from numerous available programs to be included in their synthesis would vary, and
primary studies with little regard to the varying method- that the research methods for studying these summer pro-
ologies and relevance to the issue at hand. Others, using grams would as well (2000). Indeed, they hypothesized
highly restrictive criteria for the inclusion of primary about the potential differences in effect estimates associ-
514 REPORTING THE RESULTS

ated with the numerous identifiable (and unidentifiable) odologists and consumers of the research literature to
variations noted in the summer school literature: We recognize the biases in the literature and to understand
could continue to posit predictions. Most of the multi- empirically both their frequency and magnitude.
tude of permutations and interacting influences, how- The Borman et al. meta-analysis of research on the
ever, have no or few empirical testings described in the achievement effects of twenty-nine of the most widely
research literature (Cooper et al 2000, 17). How they disseminated comprehensive school reform (CSR) mod-
dealt with this randomness, the numerous moderators of els is a noteworthy example of this approach (2003). The
effect sizes, and the general complexities of this litera- authors began by assessing how and to what extent vari-
ture are instructive. ous methodological differences (for example, experimen-
First, they contrasted underused (at least in the social tal or nonexperimental designs, and internal or third-party
sciences) random-effects analyses of their data with more evaluations of the reform programs) shaped understand-
conventional fixed-effects analyses. Second, they used an ings of the overall achievement effects of CSR. Specifi-
infrequently used technique for adjusting effect sizes to cally, the preliminary analysis empirically identified and
remove the covariance between methodological and sub- quantified the methodological biases in the literature and,
stantive characteristics. Third, they showed how an ana- in general, characterized the overall quality of the research
lyst may take advantage of subgroups of studies that con- evidence.
tain substantively interesting variations within them. For After characterizing the overall CSR research base
instance, in looking at a subgroup of studies that mea- through a combination of narrative description and syn-
sured both math and reading outcomes, the authors were thetic analysis, Borman and his colleagues focused on only
able to draw stronger inferences about differing impacts the subgroup of studies that provided the best evidence for
of summer school programs on math and reading achieve- evaluating the effectiveness of each of the twenty-nine
ment because most, if not all, other variations were held CSR models: experimental and matched comparison group
constant (for a discussion of the advantages of within- studies that were conducted by third-party evaluators who
study moderators, see chapter 24, this volume). Finally, were independent from the CSR program developers
the authors clearly outlined and discussed the complexi- (2003). The authors determined which studies provided
ties of this literature, and described how the known dif- the best evidence, not by a priori judgments or by other
ferences across the studies affected summer school effect potentially subjective criteria, but by empirical analyses
estimates, thereby enriching the quantitative results with and narrative description of the CSR literatures method-
narrative description. In this way, the authors productively ological biases. The final analyses relied on only the best
used the variability across studies to qualify the condi- evidence, but the additional analyses and descriptions
tions under which summer programs are more successful were also instructive because the methodological biases
and to quantify precisely how much more successful they were clearly identified, quantified, and discussed.
are. This combination of conventional meta-analytic pro- These two examples suggest how analysts can not only
cedures, relatively novel integrative procedures, and nar- contend with a highly variegated research literature, but
rative description provides a strong practical example also effectively model and describe its differences to help
of how a complex literature should be integrated and re- advance the field. Cooper et al. used a combination of
ported. narrative description and a series of synthetic analyses to
As Glass suggested, by empirically examining a di- help describe how practitioners can make the most of
verse range of studies, we may assess how and to what summer school programs for students (2000). Borman et
extent methodological differences across the primary al. productively modeled and described the contextual
studies are associated with differences in the observed and methodological variability within the school reform
outcomes (1976). When outcomes are robust across stud- literature to better understand the conditions under which
ies of varying methodologies, one can be more confident the policy was most productive and to evaluate empiri-
in the conclusions. On the other hand, if studies differ in cally and discuss the methodological biases within the
terms of both rigor and results, then one can focus on the research base (2003). Both syntheses helped move their
subset of more rigorous studies when formulating con- respective fields forward in ways that would not have
clusions. This analysis of the consequences of method- been possible had the researchers simply engaged in a
ological variations for the estimation of the effects, which synthesis of a relatively homogeneous set of studies meet-
is unique to the enterprise of meta-analysis, allows meth- ing specific a priori criteria.
VISUAL AND NARRATIVE INTERPRETATION 515

26.3.2.3 Interpreting the Magnitude of Effect Sizes scriptions are possible (Cohen 1988; Lipsey 1990)). One
Beyond testing the statistical significance of empirical alternative would be to compare an effect size to earlier
findings, it has become increasingly common practice to reported effect sizes from studies involving similar sam-
report the magnitude of an effect or the strength of a rela- ples, comparable outcomes, and related phenomena. For
tionship using effect sizes. Indeed, the most recent Amer- instance, a researcher might study the impact of an inno-
ican Psychological Association publication manual lists vative classroom-based intervention for struggling read-
the failure to report effect sizes as a defect in reporting ers that, when contrasted to a no-treatment control condi-
research (American Psychological Association 2001). tion, has an effect size of d 0.20a small effect size
Relative to reporting only the statistical significance of based on Cohens benchmarks. The savvy reader, how-
research findings, reporting the magnitude of effects and ever, may be particularly interested in how this interven-
relationships is certainly an advance. However, how tions effect compares to those of other interventions, in-
should one interpret and describe the practical signifi- cluding competing classroom-based reading programs,
cance of effect sizes? Typically, effect-size estimates are or an alternate policy such as one-on-one tutoring in lit-
interpreted in two ways. One is to rely on commonly ac- eracy. If these alternate interventions produce roughly
cepted benchmarks that differentiate small, medium, and similar effects, the reading program in question could be
large effects. A second method is to explicitly compare interpreted as having rather typical, or medium, effects.
the reported effect size to those reported in earlier but Another consideration, particularly in the policy arena, is
similar studies. cost. If the reading program and the one-on-one tutoring
The most established benchmarks are those presented program appear to produce similar outcomes, but the per-
by Jacob Cohen, where an effect size, d, of 0.20 is inter- pupil cost of tutoring is substantially higher, then the
preted as a small effect 0.50 equates to a medium effect, reading programs effects may no longer appear small nor
and effects larger than 0.80 are regarded as large effects typical, but optimal. Thus, the judgment of magnitude
(1988). Cohen provided these effect size values to serve can be enriched or even altered by comparing the meta-
as operational definitions of the quantitative adjectives analytic results with a wide swath of prior research.
small, medium, and large (12). However, even Cohen Indeed, Gene Glass, Barry McGaw, and Mary Lee
has expressed reservations about the general application Smith were particularly supportive of this approach, ar-
of these benchmarks: guing that the effectiveness of a particular intervention
can be interpreted only in relation to other interventions
These proposed conventions were set forth throughout with that seek to produce the same effect (1981). In addition,
much diffidence, qualifications, and invitations not to em- Cooper suggested that the researcher should use multiple
ploy them if possible. The values chosen had no more reli- yardsticks to help frame and interpret the practical im-
able a basis than my own intuition. (1988, 532) portance of the noted effects (1981). More recently, Car-
olyn Hill and her colleagues argued that instead of uni-
Despite these provisos, Lipseys ambitious summary versal standards, such as the Cohen benchmarks, one
of the distribution of mean effect sizes from 102 meta- must interpret magnitudes of effects in context of the in-
analyses of psychological, educational, and behavioral terventions being studied, of the outcomes being mea-
treatments provided consistent empirical support for Co- sured, and of the samples or subsamples being examined
hens self-confessed intuitions (1990). Lipseys find- (2007). They also suggested different frames of refer-
ings suggested that the bottom third of this distribution ence, including information about effect size distribu-
ranged from d 0.00 to d 0.32, the middle third ranged tions, other external performance criteria, expected or
from d 0.33 to d 0.55, and the top third ranged from normative change on a given outcome, and existing per-
d 0.56 to d 1.20. The median effect sizes for the low, formance gaps among subgroups (for example, the mag-
middle, and top third of the distribution were d 0.15, nitude of the black-white achievement gap, or the female-
d 0.45, and d 0.90, respectively, which might also be male wage gap).
used as general standards. We concur that these types of comparisons to other
Although the Jacob Cohen and Mark Lipsey bench- benchmarks or yardsticks are very informative for help-
marks provide relatively consistent advice to researchers ing readers understand the practical significance of re-
to help them describe in more practical terms the magni- ported effect sizes. Whenever possible, such information
tude of an effect size, we believe that more nuanced de- should be discussed and be reported as part of the narrative
516 REPORTING THE RESULTS

description of the effect of a phenomenon to help contex- subjective criteria for eliminating studies necessitates
tualize and describe its impacts. narrative description of how and why such study differ-
ences contribute to variability in effect size data. When
26.3.3 Integrating Approaches Summary this is appropriate and when it is done well, analysts can
contribute far more to the field than they might have by
Bodies of empirical research in the social sciences and censoring the included research.
education are often characterized by caveats and com- Help Readers Understand Effect Size Results by Plac-
plexities that make solely synthetic approaches to research ing Them in Context. A meta-analysis that uses narrative
synthesis overly reductive and simplistic. In an effort to description to help interpret effect-size data in context of-
summarize key findings from the literature that are not fers far more to the field than simple summaries and com-
amenable to meta-analytic integration, to further clarify parisons to general standards of large, medium, and small
and describe a variegated research literature, and to offer effects. We suggest various yardsticks for understanding
greater clarity and interpretation on the magnitude and the magnitudes of effect sizes (Cooper 1981), including:
practical significance of effect-size data, we believe that Cohens U3 (1981); information gleaned from compendia
research syntheses can often benefit from the integration of effect size data from particular fields of study, such as
of narrative and quantitative forms of synthesis. those Lipsey and Wilson summarize (1993); effects of
We offer several key recommendations for augmenting other interventions designed for groups or individuals who
synthetic procedures with narrative methods of research are similar to those included in the current meta-analysis;
synthesis: standardized mean differences that exist among pertinent
Be Mindful When Considering Whether to Drop a subgroups in similar untreated samples or populations
Study from the Review Sample, Because a Study Can Be (for example, the black-white test score gap); existing
Useful Even if It Does Not Offer an Effect Size for Inclu- standards for normative growth or change in the outcome
sion in a Meta-Analysis. Bad evidence is worse than no that may be derived from data for similar untreated sam-
evidence, but even when a literature provides few results ples or populations (for example, the expected achieve-
that are amenable to traditional forms of meta-analysis, ment gain associated with one grade level or school year);
the analyst can produce some gains in the accumulation and the use of cost data in conjunction with effect size
of knowledge by carefully synthesizing the results that do information.
exist and by describing its limitations so that it will be
properly interpreted and so that future research in the area 26.4 CONCLUSION
may be better designed.
Heed the 1981 Advice of Glass and His Colleagues to This chapter has outlined how recent meta-analytic work
Analyze the Full Complement of Potential Studies, Con- has used graphical representations and narrative descrip-
sider How Their Differences May Inform the Results, and tions to enhance the interpretation of effect size data. Al-
Then Describe to the Reader How Those Results Came to though optimal use of figures and integration of narrative
Be. Meta-analysis can help researchers understand vari- synthesis has not been commonplace in the meta-analytic
ability in effects noted across a given substantive area, but literature, we have observed several recent developments
this understanding is incomplete without careful blending that suggest that future work may more regularly feature
of narrative description of the methodological and sub- these elements. First, with respect to the graphical repre-
stantive factors that are associated with the differences. sentation of effect size data, we are encouraged to note
Although some analysts cull potential studies on the basis that published meta-analyses in Psychological Bulletin
of substantive moderators and potential sources of bias, have more regularly included figures from 2000 through
we generally favor including studies that might influence 2005 than they did from 1985 through 1991 reported pre-
the results of a synthesis and then empirically examining viously by Light, Singer, and Willett (1994).
their impact. Including them not only identifies how sen- Second, use of high-quality figures that effectively pres-
sitive the results are to the composition of the analytical ent the big picture is also on the rise in certain contexts.
sample, but also can lead to a fruitful discussion of the Most notably, the Cochrane Collaboration and its social
influence of moderating variables (which can, of course, sciences counterpart, the Campbell Collaboration, have
then be illustrated visually). However, this objective ex- helped advance the use of the forest plot in their system-
amination of all the available evidence in lieu of applying atic reviews. We found no published examples of forest
VISUAL AND NARRATIVE INTERPRETATION 517

plots in Psychological Bulletin and Review of Educational 2. The radial plot is sometimes referred to as a Galbraith
Research but are hopeful that the work of the Campbell Plot, after the statistician who created it.
Collaboration and its increasing influence on the research 3. The LAbb plot (1987) is another method of repre-
communities from the social sciences and education will senting dichotomous outcome data that has gained
affect future published work. The forest plot, which is favor in the clinical and epidemiological literature
effective for presenting a comprehensive picture along (see, for example, Jimnez et al. 1997; Song 1999).
with the finer details, at least for a manageable number of 4. The word occasionally appears as Forrest plot, for rea-
studies included in the analysis, is the current state-of- sons that are unknown and perhaps apocryphal (Lewis
the-art graphic representation of effect size data. and Clarke 2001). Here we follow the widely used for-
We believe that meta-analysis has largely overcome est plot spelling.
early criticisms suggesting that the methodology failed to 5. The Cochrane Library is a collection of evidence-
distinguish and account for important methodological based medicine databases that can be accessed at
and substantive differences that can affect the synthetic http://www.thecochranelibrary.com or http://www
outcomes. Instead, high-quality syntheses, when appro- .cochrane.org. As of May 2008, the Cochrane Librarys
priate, explicitly model the effects of these methodologi- database included 5,320 records, consisting of 3,464
cal and substantive moderators and interpret them. These complete reviews and protocols for a further 1,856
types of analyses are important contributions of meta- reviews (http://www3.interscience.wiley.com/cgi-bin/
analysis that extend beyond simple reports of mean effect mrwhome/106568753/ ProductDescriptions.html). The
sizes and contribute stronger understandings of how and Cochrane Review Manager (RevMan), which performs
why variegated research literatures yield differential ef- the analysis and generates graphic representations, is
fect estimates. Further, not all results are amenable to available free of charge to those who prepare reviews
conventional meta-analytic procedures. In some cases, for the Cochrane Library and to others who wish to
though, one may learn a great deal by summarizing these use it for noncommercial purposes. Active editorial
outcomes and by describing the contexts under which oversight, strict design parameters, and a standardized
they were observed. If nothing else, these cases can be tool account for the uniformity of the reviews in the
illustrative to the field in suggesting what is needed in the Cochrane Library.
future to help produce better evidence that is cumulative. 6. The Campbell Collaboration can be accessed at http://
Finally, along with graphical representation of effect www.campbellcollaboration.org.
size data, narrative interpretation of the magnitude and 7. Metagraph was written by Dr. Adrian Mander, senior
practical significance of effect size data is a key consider- biostatistician at MRC Human Nutrition Research,
ation to help make meta-analytic results more informa- Cambridge, U.K. Stata includes a suite of commands
tive, intuitive, and policy-relevant. We urge those con- for conducting meta-analysis and representing find-
ducting research syntheses in the social sciences to not ings, including: meta, metan, and metareg; see Sterne
only develop the big picture and its finer details through et al. (2007) for more infomation.
the use of figures, but also to take great care to put the 8. The Comprehensive Meta Analysis (CMA) software
effects in context. Although Jacob Cohens original bench- is produced by Biostat, Inc. A stand-alone plotting en-
marks for judging the magnitudes of effect sizes have gine for graphics is anticipated in 2007 that will allow
provided some useful guidance, researchers must go be- for the user to modify rows, cells, and columns, create
yond these very general yardsticks and consider others, .pdf files, and allows users to work with values com-
such as the effects of comparable phenomena on similar puted in other programs; an updated version of the
outcomes involving similar populations along with other CMA program is expected to be released in 2008. A
estimates of expected or normative change on the out- free trial of the software is available for download at
comes of interest (1988). http://www.Meta-Analysis.com.

26.5 NOTES 26.6 REFERENCES


1. Other graphs included scatter plots, line graphs, box Albarracn, Dolores, Jeffrey C. Gillette, Allison N. Earl, Laura
or schematic plots, quantile plots, and path analysis R. Glasman, Marta R. Durantini, and Moon-Ho Ho. 2005.
diagrams. A Test of Major Assumptions About Behavior Change: A
518 REPORTING THE RESULTS

Comprehensive Look at the Effects of Passive and Active ding Panels Meta-Analysis. Review of Educational Re-
HIV-Prevention Interventions Since the Beginning of the search 71(3): 393447.
Epidemic. Psychological Bulletin 131(6): 85697. Eysenck, Hans J. 1978. An Exercise in Mega-Silliness.
American Psychological Association. 2001. Publication American Psychologist 33(5): 517.
Manual of the American Psychological Association, 5th ed. Freiman, J. A., T. C. Chalmers, H. Smith, and R. R. Kuebler.
Washington, D.C.: American Psychological Association. 1978. The Importance of Beta, The Type II Error and
Begg, Colin B., and Madhuchhanda Mazumdar. 1994. Ope- Sample Size in the Design and Interpretation of the Rando-
rating characteristics of a rank correlation test for publica- mized Control Trial: Survey of 71 Negative Trials. New
tion bias. Biometrics 50(4): 1088101. England Journal of Medicine 299 (13): 69094.
Bertin, Jacques. 1983. Semiology of Graphics, translated by Galbraith, Rex F. 1988a. Graphical Display of Estimates
William J. Berg. Madison: University of Wisconsin Press. Having Different Standard Errors. Technometrics 30(3):
Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman, 27181.
and Shelly Brown. 2003. Comprehensive School Reform . 1988b. A Note on Graphical Presentation of Estima-
and Achievement: A Meta-Analysis. Review of Educatio- ted Odds Ratios from Several Clinical Trials. Statistics in
nal Research 73(2): 125230. Medicine 7(8): 88994.
Cohen, Jacob. 1988. Statistical Power Analysis for the Beha- . 1994. Some Applications of Radial Plots. Journal of
vioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erl- the American Statistical Association 89(428): 123242.
baum. Gershoff, Elizabeth Thompson. 2002. Corporal Punishment
Cohen, Peter A. 1981. Student Ratings of Instruction and by Parents and Associated Child Behaviors and Experien-
Student Achievement: A Meta-Analysis of Multisection ces: A Meta-Analytic and Theoretical Review. Psycholo-
Validity Studies. Review of Educational Research 51(3): gical Bulletin 128(4): 53979.
281309. Glass, Gene V. 1976. Primary, Secondary, and Meta-Analy-
. 1983. Comment on A selective review of the validity sis Research. Educational Researcher 5(10): 38.
of student ratings of teaching. Journal of Higher Educa- Glass, Gene V., Barry McGaw, and Mary Lee Smith. 1981.
tion 54(4): 44958. Meta-Analysis in Social Research. Beverly Hills, Calif.:
Cook, Thomas D., and Laura Leviton. 1980. Reviewing the Sage Publications.
Literature: A Comparison of Traditional Methods with Gray-Little, Bernadette, and Adam R. Hafdahl. 2000. Fac-
Meta-analysis. Journal of Personality 48(4): 44972. tors Influencing Racial Comparisons of Self-Esteem:
Cooper, Harris M. 1981. On the Significance of Effects and A Quantitative Review. Psychological Bulletin 126(1):
the Effects of Significance Journal of Personality and So- 2654.
cial Psychology 41(5): 101318. Harris, Robert L. 1999. Information Graphics: A Compre-
Cooper, Harris M., Kelly Charlton, Jeff C. Valentine, and hensive Illustrated Reference. New York: Oxford Univer-
Laura Muhlenbruck. 2000. Making the Most of Summer sity Press.
School: A Meta-Analytic and Narrative Review. Mono- Hedges, Larry V., and Ingram Olkin. 1985. Statitical Methods
graphs of the Society for Research in Child Development for Meta-analysis. Orlando, Fl.: Academic Press.
65(1): 1118. Hill, Carolyn J., Howard S. Bloom, Alison Rebeck Black,
Cleveland, William S. 1994. The Elements of Graphing and Mark W. Lipsey. 2007. Empirical Benchmarks for
Data, rev. ed. Summit, N.J.: Hobart Press. Interpreting Effect Sizes in Research. MDRC Working
Dickerson, Sally S., and Margaret E. Kemeny. 2004. Acute Papers on Research Methodology. New York: MRDC.
Stressors and Cortisol Responses: A Theoretical Integra- http://www.mdrc.org/publications/459/full.pdf
tion and Synthesis of Laboratory Research. Psychological Jimnez, F. Javier, Eliseo Guallar, and Jos M. Martn-Mo-
Bulletin 130(3): 35591. reno. 1997. A Graphical Display Useful for Meta-Analy-
Egger, Matthias, George Davey Smith, Martin Schneider, sis. European Journal of Public Health 7(1): 10105.
and Christoph Minder. 1997. Bias in Meta-Analysis De- Kumkale, G. Tarcan, and Albarracn, Dolores. 2004. The
tected by a Simple, Graphical Test. British Medical Jour- Sleeper Effect in Persuasion: A Meta-Analytic Review.
nal 315(7109): 62934. Psychological Bulletin 130(1): 14372.
Ehri, Linnea C., Simone R. Nunes, Steven A. Stahl, and Dale LAbb, Kristan A., Allan S. Detsky, and Keith ORourke.
M. Willows. 2001. Systematic Phonics Instruction Helps 1987. Meta-Analysis in Clinical Research. Annals of In-
Students Learn to Read: Evidence from the National Rea- ternal Medicine 107(2): 22433.
VISUAL AND NARRATIVE INTERPRETATION 519

Lewis, Steff, and Mike Clarke. 2001. Forest Plots: Trying to Language Learners. Review of Educational Research
See the Wood and the Trees. British Medical Journal 75(2): 24784.
322(7300): 147980. Song, Fujian. 1999. Exploring Heterogeneity in Meta-Ana-
Lewis, J. A., and S. H. Ellis. 1982. A Statistical Appraisal of lysis: Is the LAbb Plot Useful? Journal of Clinical Epi-
Post-Infarction Beta-Blocker Trials. Primary Cardiology demiology 52(8): 72530.
1(Suppl.): 3137. Sterne, Jonathan A. C., and Matthias Egger. 2001. Funnel
Light, Richard J., and David B. Pillemer. 1984. Summing Plots for Detecting Bias in Meta-Analysis: Guidelines on
Up: The Science of Reviewing Research. Cambridge, Mass.: Choice of Axis. Journal of Clinical Epidemiology 54(10):
Harvard University Press. 104655.
Light, Richard J., Judith D. Singer, and John B. Willett. Tufte, Edward R. 1990. Envisioning Information. Cheshire,
1994. The Visual Presentation and Interpretation of Meta- Conn.: Graphics Press.
Analyses. In The Handbook of Research Synthesis, edited . 1997. Visual Explanations: Images and Quantities,
by Harris Cooper and Larry V. Hedges. New York: Russell Evidence and Narrative. Cheshire, Conn.: Graphics Press.
Sage Foundation. . 2001. The Visual Display of Quantitative Information,
Lipsey, Mark W. 1990. Design Sensitivity: Statistical Power 2nd ed. Cheshire, Conn.: Graphics Press.
for Experimental Research. Newbury Park, Calif.: Sage Pu- . 2006. Beautiful Evidence. Cheshire, Conn.: Graphics
blications. Press.
Moyer, Christopher A., James Rounds, and James W. Han- Tukey, John W. 1977. Exploratory Data Analysis. Reading,
num. 2004. A Meta-Analysis of Massage Therapy Re- Mass.: Addison-Wesley.
search. Psychological Bulletin 130(1): 318. . 1988. The Collected Works of John W. Tukey, Volume
Playfair, William. 1801/2005. The Commercial and Political V: Graphics 19651985, edited by William S. Cleveland.
Atlas and Statistical Breviary, edited by Howard Wainer Pacific Grove, Calif.: Wadsworth and Brooks.
and Ian Spence. New York: Cambridge University Press. Turkheimer, Eric, and Mary Waldron. 2000. Nonshared En-
Raudenbush, Stephen W. 1984. Magnitude of Teacher Ex- vironment: A Theoretical, Methodological, and Quantita-
pectancy Effects on Pupil IQ as a Function of the Credibi- tive review. Psychological Bulletin 126(1): 78108.
lity of Expectancy Induction: A Synthesis of Findings From Wang, Morgan C., and Brad J. Bushman. 1998. Using the
18 Experiments. Journal of Educational Psychology 76(1): Normal Quantile Plot to Explore Meta-Analytic Data Sets.
8597. Psychological Methods 3(1): 4654.
Raudenbush, Stephen W., and Anthony S. Bryk. 1985. Em- Wainer, Howard. 1984. How to Display Data Badly. The
pirical Bayes Meta-Analysis. Journal of Educational Sta- American Statistician 38(2): 13747.
tistics 10(2): 7598. . 1996. Depicting Error. The American Statistician
Schwarzer, Guido, Gerd Antes, and Martin Schumacher. 50(2): 10111.
2002. Inflation of Type I Error Rate in Two Statistical Tests Wainer, Howard, and Paul F. Velleman. 2001. Statistical
for the Detection of Publication Bias in Meta-Analyses with Graphics: Mapping the Pathways of Science. Annual Re-
Binary Outcomes. Statistics in Medicine 21: 246577. view of Psychology 52: 30535.
Slavin, Robert E. 1984. Meta-Analysis in Education: How Whitehead, Anne, and John Whitehead. 1991. A General
Has It Been Used? Educational Researcher 18(8): 627. Parametric Approach to the Meta-Analysis of Randomized
. 1986. Best-Evidence Synthesis: An Alternative to Clinical Trials. Statistics in Medicine 10(11): 166577.
Meta-Analytical and Traditional Reviews. Educational Yusuf, Salim, Richard Peto, John Lewis, Rory Collins, and
Researcher 9(15): 511. Peter Sleight. 1985. Beta Blockage During and After Myo-
Slavin, Robert E., and Alan Cheung. 2005. A Synthesis of cardial Infarction: An Overview of the Randomized Trials.
Research on Language of Reading Instruction for English Progress in Cardiovascular Diseases 27(5): 33571.
27
REPORTING FORMAT

MIKE CLARKE

CONTENTS
27.1 Introduction 522
27.2 Title for Research Synthesis 525
27.3 Question for Research Synthesis 526
27.3.1 Background 526
27.3.2 Objectives 527
27.3.3 Eligibility Criteria 527
27.4 Methods 528
27.4.1 Search Strategy 528
27.4.2 Application of Eligibility Criteria 529
27.4.3 Data Extraction or Collection 529
27.4.4 Statistical Methods 529
27.4.5 Changes Between Design or Protocol and Synthesis 530
27.5 Results 530
27.6 Discussion 531
27.6.1 Conclusions and Implications 531
27.6.2 Updating the Synthesis 532
27.7 Author Notes 532
27.7.1 Author Contributions 532
27.7.2 Acknowledgments 532
27.7.3 Funding 533
27.7.4 Potential Conflict of Interest 533
27.8 Abstract and Other Summaries 533
27.9 Summary Checklist 533
27.10 References 533

521
522 REPORTING THE RESULTS

27.1 INTRODUCTION ect and then make this available to provide full details of
their methods.
The other chapters in this handbook discuss the complex- Cochrane reviews assess the effects of health-care in-
ities and intricacies of conceptualizing and conducting terventions, but the basic concepts can be applied to any
research syntheses or systematic reviews. If these pieces research synthesis, regardless of whether it is of the ef-
of research are to be of most use to readers, they need to fects of interventions or centered on health care. In con-
be reported in a clear way. The authors should provide a sidering the content of the report of a research synthesis,
level of detail that allows readers to assess the methods one of the main constraints might be the limits on article
used for the review, as well as the robustness and reliabil- length and content (such as tables and figures) by jour-
ity of its findings, any statistical analyses, and the conclu- nals. However, with the availability of web repositories
sions. If readers have come to the review for information for longer versions of articles, these constraints should be
to help make a decision about future action, they need to less of a problem for research synthesists.
be presented with the necessary knowledge clearly and This chapter takes the reader through the various issues
unambiguously and to be able to judge the quality of the to consider when preparing a report of a research synthe-
conduct of the review. sis. It also serves as a guide to what a reader of a review
This chapter summarizes and focuses on the need for might expect to find within such a structured report. When
the authors to set out what they did, what they found, and considering the perspective of the reader of their review,
what it means. This is as true of a research synthesis as it synthesists also need to consider that there may be mul-
is of any article reporting a piece of research. Much of tiple audiences for their findings. Findings may have rel-
what is written here seeks to reinforce the ideas of good evance to people making decisions at a regional or national
writing practice in research, making it specific to report- level, as well as to individual practitioners and members
ing research synthesis. It builds on the chapter in the first of the public deciding about actions of direct relevance to
edition by Katherine Halvorsen (1994). Readers of this themselves or individuals they are caring for. The review
chapter who want more detailed guidance might wish to might also be a source of knowledge for other researchers
consult additional sources such as the QUOROM state- planning new studies or other reviews. And, by providing
ment for journal reports of systematic reviews of random- a synthesis of existing research in a topic area, these re-
ized trials (Moher et al. 1999) and the Cochrane Hand- ports can be of interest to people wishing to find out more
book for Systematic Reviews of Interventions (Higgins about how the area of research has evolved over time.
and Green 2008). Furthermore, the output of the Working Those who have done the research synthesis should there-
Group on American Psychological Association Journal fore decide whether to try to speak to all of these audi-
Article Reporting Standards (the JARS Group) outlined ences in a single report, target one audience only, or pre-
in table 27.1, provides guidance on reporting research pare a series of separate, adequately cross-referenced,
synthesis in the social sciences (see section 27.9). publications.
The structure adopted for Cochrane Collaboration sys- If you are reading this chapter as a guide to reporting
tematic reviews (www.cochrane.org, www.thecochrane your own research synthesis, you should consider who
library.com) has also been influential in this chapter. you are expecting to read and be influenced by your find-
These reviews, of which there are more than 3,500 by late ingsusing the cited examples as a starting pointbe-
2008, all follow the same basic structure using a standard fore deciding on the key issues to describe in most detail
set of headings and following strong guidance on the in- in the report. For example, a report meant for academics
formation that should be presented within each section and researchers should be very detailed about the meth-
(Allen, Clarke, and Tharyan 2007). All Cochrane reviews ods of the review and of the relevant studies. A report
are preceded by a published protocol setting out the meth- meant for people making decisions about interventions
ods for the research synthesis in advance. In several parts investigated in the review will need to present the find-
of this chapter, I refer to the possibility of referring read- ings of the review clearly and in enough detail for these
ers back to the protocol as a way to provide detail on the readers to identify the elements of most importance to
methods of the research synthesis without adding to the their decisions. A report aimed at policy makers might be
length of its report. Even if a formal protocol is not pub- shorterproviding less detail about the methods of the
lished for a research synthesis, the synthesists might wish review, summarizing the overall results of the review, giv-
to prepare one as part of their own planning of their proj- ing more consideration to the implications of the findings
REPORTING FORMAT FOR RESEARCH SYNTHESES 523

Table 27.1 Meta-Analysis Reporting Standards (MARS)

Paper Section and Topic Description

Title Make it clear that the report describes a research synthesis and include meta-analysis,
if applicable Footnote funding source(s)
Abstract The problem or relation(s) under investigation
Study eligibility criteria
Type(s) of participants included in primary studies
Meta-analysis methods (indicating whether a fixed or random model was used)
Main results (including the more important effect sizes and any important moderators of
these effect sizes)
Conclusions (including limitations)
Implications for theory, policy and/or practice
Introduction Clear statement of the question or relation(s) under investigation
Historical background

Theoretical, policy and/or practical issues related to the question or relation(s) of


interest
Rationale for the selection and coding of potential moderators and mediators of results

Types of study designs used in the primary research, their strengths and weaknesses

Types of predictor and outcome measures used, their psychometric characteristics


Populations to which the question or relation is relevant


Hypotheses, if any

Method Operational characteristics of independent (predictor) and dependent (outcome) variable(s)


Inclusion and exclusion Eligible participant populations
criteria Eligible research design features (random assignment only, minimal sample size)
Time period in which studies needed to be conducted
Geographical and/or cultural restrictions
Moderator and mediator Definition of all coding categories used to test moderators or mediators of the
analyses relation(s) of interest
Reference and citation databases searched
Registries (including prospective registries) searched
Keywords used to enter databases and registries

Search software used and version (Ovid)


Time period in which studies needed to be conducted, if applicable


Search strategies Other efforts to retrieve all available studies,
Listservs queried

Contacts made with authors (and how authors were chosen)


Reference lists of reports examined


Method of addressing reports in languages other than English


Process for determining study eligibility
Aspects of reports were examined (i.e., title, abstract, and/or full text)

Number and qualifications of relevance judges


Indication of agreement

How disagreements were resolved


Treatment of unpublished studies
Number and qualifications of coders (e.g., level of expertise in the area, training)
Intercoder reliability or agreement
Whether each report was coded by more than one coder and if so, how disagreements were
resolved

(Continued)
524 REPORTING THE RESULTS

Table 27.1 (Continued)

Paper Section and Topic Description

Coding procedures Assessment of study quality


If a quality scale was employed, a description of criteria and the procedures for application

If study design features were coded, what these were


How missing data were handled


Effect size metric(s)
Effect sizes calculating formulas (means and SDs, use of univariate F to r transform, and

so on)
Corrections made to effect sizes (small sample bias, correction for unequal ns, and so on)

Effect size averaging or weighting method(s)


How effect-size confidence intervals (or standard errors) were calculated
How effect-size credibility intervals were calculated, if used
How studies with more than one effect size were handled
Statistical methods Whether fixed or random effects models were used and the model choice justification
How heterogeneity in effect sizes was assessed or estimated
Means and SDs for measurement artifacts, if construct-level relationships were the focus
Tests and any adjustments for data censoring (publication bias, selective reporting)
Tests for statistical outliers
Statistical power of the meta-analysis
Statistical programs or software packages used to conduct statistical analyses
Results Number of citations examined for relevance
List of citations included in the synthesis
Number of citations relevant on many but not all inclusion criteria excluded from the
meta-analysis
Number of exclusions for each exclusion criteria (effect size could not be calculated), with

examples
Table giving descriptive information for each included study, including effect size and
sample size
Assessment of study quality, if any
Tables and/or graphic summaries
Overall characteristics of the database (number of studies with different research designs)

Overall effect size estimates, including measures of uncertainty (confidence and/or


credibility intervals)
Results of moderator and mediator analyses (analyses of subsets of studies)
Number of studies and total sample sizes for each moderator analysis

Assessment of interrelations among variables used for moderator and mediator analyses

Assessment of bias including possible data censoring


Discussion Statement of major findings


Consideration of alternative explanations for observed results
Impact of data censoring

Generalizability of conclusions,
Relevant populations

Treatment variations

Dependent (outcome) variables


Research designs, and so on


General limitations (including assessment of the quality of studies included)


Implications and interpretation for theory, policy, or practice
Guidelines for future research

source: Reprinted with permission from JARSWG 2008


REPORTING FORMAT FOR RESEARCH SYNTHESES 525

of the review for practice, and placing these in the appro- Having set down the study design for the research ar-
priate context for decision makers who might simply be ticle (that is, that it is a research synthesis), the remainder
seeking an answer to the question of what to do based on of the title can usually be constructed much as those for
the research that has already been done. This wide variety studies that are eligible to the review. It might show all or
of audiences and the range of ways in which syntheses some of the population being studied, what happened to
can be published means that there is no single recipe for them, the outcome of most interest, and the types of study
the structure of a report. This chapter is intended to cover design eligible for the review. For example, A research
all of the key areas and recognizes that not all reports of synthesis of [study design] of [intervention] for [popula-
research syntheses will need to give the same emphasis to tion] to assess the effect on [outcome]. Adopting this
each section. structure, embedding key descriptors of the review in the
title, should also help people find the review when they
27.2 TITLE FOR RESEARCH SYNTHESIS search bibliographic databases or scan contents lists of
journals. This is increasingly important because the ex-
The starting point for any research synthesis should be a plosion in research articles, syntheses, and systematic
clearly formulated question. This sets out the scope for reviews means that finding these in the literatureamidst
the review. Although it may be similar to the question for the reports of the individual studies, discussion pieces,
a new study, the synthesist needs to recognize that she is and reviews that are not systematicis a challenge for
not planning a new study, in which she would be respon- users of reviews, which the author of the research synthe-
sible for the design, recruitment and follow-up proce- sis can help overcome when deciding on a title for the
dures. She is simply setting the boundaries for the studies synthesis (Moher et al. 2007).
that have already been done by others. The question pro- The authors of the research synthesis need to minimize
vides a guide to what will, and will not, be eligible once the possibility that someone who should read the report is
these studies are identified. Having formulated a clear misled and skips over it or, conversely, that a reader will
question and conducted the review, the title of its report be misled into going further into a report not relevant to
needs to capture the essence of this, so that people who their interests. For example, a title such as A research
would benefit from reading the review are more likely to synthesis of randomized trials of antibiotics for prevent-
find it and, having found it, to read it. ing postoperative infections might, at first sight, seem
As with any research article, a good title needs to con- fairly complete. However, such a title could cover a re-
tain the key pieces of information about, at least, the de- view of the role of antibiotics compared to placebo to
sign of the research being described. As a starting point, prevent infection (to identify whether it is better to use
it needs to identify the article as a research synthesis, sys- antibiotics), direct comparisons of different antibiotics
tematic review, or meta-analyses. These words and phrases (to identify the best one), or comparisons of antibiotics
will have different meanings to different authors and against all other forms of infection control (to find which
readers. If the synthesis sets out to systematically search is the best intervention to adopt). The underlying ques-
for, appraise, and present eligible studies (without neces- tion and scope for each of these three reviews are funda-
sarily doing any statistical analyses on these studies) then mentally different, but the general title fails to capture
perhaps research synthesis or systematic review would be this aspect for the reader. In such circumstances, titles
most appropriate. If it contains a meta-analysis, then this such as A research synthesis of randomized placebo
term might be used on its own. However, it is worth re- controlled trials of antibiotics versus placebo for prevent-
membering that meta-analysis can be done by combining ing postoperative infections, A research synthesis of
the results of studies without the formal process of a sys- randomized trials comparing different antibiotics for pre-
tematic review. For example, a research group might venting postoperative infections, and A research syn-
present a meta-analysis combining the results of a series thesis of randomized trials of antibiotics versus other
of trials they have done, without drawing in the results of strategies for preventing postoperative infections might
other, similar trials by others. Therefore, combining the be more appropriate.
phrases to produce A research synthesis, including a Some authors might wish to use a declarative title,
meta-analysis of might capture better the content of a summarizing the main finding of their review, such as
report of a research synthesis that includes the mathemat- Tamoxifen improves the chances of surviving of breast
ical combination of the results of eligible studies. cancer: a research synthesis of randomized trials. Other
526 REPORTING THE RESULTS

authors might choose a more neutral title, such as A tinue reading it and, if they are particularly interested in
research synthesis and meta-analysis of tamoxifen for one of the questions, to identify that part.
the treatment of women with breast cancer. To some ex- It is also important to be careful that the question for
tent, the type of title might be determined by the policy the review is a fair reflection of its content. For example,
of the journal or other place of publication for the re- posing the question for a review as Does healthy eating
view. For example, the titles of all Cochrane reviews fol- advice reduce the risk of a heart attack? might seem rea-
low a particular structure, are nondeclarative and because sonable but is probably not a fair guide to the answers
they are Cochrane reviews, do not include words or that will come from the review. At its simplest, the avail-
phrases such as systematic review, research synthesis, or able answers would be yes, no, and dont know, where no
meta-analysis. covers both no effect on the risk of a heart attack and an
increased risk. It is unlikely that the review would be this
27.3 QUESTION FOR RESEARCH SYNTHESIS simple. It is more likely that the review will seek to quan-
tify the risk and to provide this estimate regardless of its
The authors of a research synthesis need a well-formu- direction. Thus, a better question might be What is the
lated question before embarking on their work. This is one effect of healthy eating advice on the risk of a heart at-
of the hallmarks of the systematic approach, compared to tack?
something more haphazard in which biases have not been However, even with that expansion, the question cur-
minimized. This helps guide the various steps of the re- rently does make the intentions for the review clear. Re-
search synthesis such as where and how to search, what membering that research syntheses are retrospective and
studies are eligible, and what the main emphasis of the that readers of reviews might have their own preconcep-
findings should be. Similarly, the reader of the eventual tions about what studies it will contain, additional infor-
review will benefit from a clear summary of the question mation is needed. This additional information should
to be addressed in the review. Formulating and clarifying clarify what comparisons will be examined in the review,
the question both at the outset of the work on the review because it will not necessarily be a comparison of inter-
and at the start of the report of its findings is important. ventions. Using the healthy eating advice example, the
One of the key things to do for readers within the ques- simple question could be expanded to What is the effect
tion is distinguish those systematic reviews that set out to of healthy eating advice compared to smoking cessation
obtain clear answers to guide future actions from those advice on the risk of a heart attack? or What is the ef-
where the intention was to describe the existing research fect of healthy eating advice compared to providing a
to help design a new, definitive study. An example of the healthy diet on the risk of a heart attack? or What is the
former might be What are the effects of one-to-one effect of healthy eating advice on the risk of a heart attack
teaching on the ability of teenagers to learn math? and of compared to a stroke? The eligible studies, review meth-
the latter, What research has been done into ways of ods and findings for these questions would be quite dif-
teaching math to teenagers? The reader should be told ferent and the report should minimize ambiguity by being
which of these two approaches is to be taken. This will clear in its question.
allow the reader to decide if the review is likely to pro-
vide an answer to the questions the reader has. Someone 27.3.1 Background
looking for a single, numeric estimate of the effect of
one-to-one teaching based on existing research might Using the background section to set the scene for the re-
find it in the former whereas, in the latter, they are more view helps orient the reader to the topic. It is also an op-
likely to find only descriptive accounts of the different portunity to convey the importance and relevance of the
pieces of research that have been done. review. However, in the face of limitations on the length
A research synthesis might contain more than one of a report, it is usually one of the sections that can easily
question and comparison. This should also be made clear be cut back. It might provide details on how common the
early in the report. For example, the earlier description of circumstances or condition under investigation are, how
three reviews of antibiotics for the prevention of postop- widely the intervention is used (if the review is about the
erative infection might be done within a single review. effects of an intervention), and how this intervention
Making it clear to the readers that this is the case early on might work. If some of the research to be included in the
in the review will help them to decide whether to con- review is already likely to be well known, that it is should
REPORTING FORMAT FOR RESEARCH SYNTHESES 527

be mentioned in the background. Some information on in the results section. For example, the original intention
the consequences and severity of the outcomes being as- might have been to include studies done anywhere in the
sessed might also be helpful. world, but all the studies identified were conducted in the
As noted, reviews are by nature retrospective and the United States. The synthesists might have wished to study
reviewers should therefore use the background section to all antibiotics and searched for all drugs, but find that
describe any influences on their reasons for doing the re- only one or two had been studied. Or the eligibility crite-
view. If they chose a particular focus, perhaps concentrat- ria for the review might have encompassed children of all
ing on one population subgroup, one type of intervention ages but researchers might have restricted their studies to
from a class of interventions, or one possible relationship, those aged between eight and twelve years.
they should explain that they have done so. This is espe- For most syntheses, the eligibility criteria can be set
cially important if readers might be surprised by the focus out under four domains:
or concerned that it arose from bias in the formulation of
Types of participants: who will have been in the eligi-
the question, such that the synthesists had narrowed their
ble studies? This might include information about
project in a particular way to reach a preformed conclu-
sex and age, past experiences, diagnosis (and per-
sion.
haps the reliability required for this diagnosis) of an
If a protocol describing the methods for the review had
underlying condition, and location.
been prepared in advance, that it was should be men-
tioned in either the background or the methods section. Types of interventions: what will have been done to, or
Likewise, if the review had been registered prospectively, by, the participants? This should be described in op-
this also should be noted. Both steps help in planning and erational terms. If the synthesis is assessing the ef-
drawing attention to the review, and also serve to mini- fects of an intervention, the authors should ensure
mize bias and the perception of bias in its conduct. that they also provide adequate information on the
comparison condition (that is, what kinds of inter-
ventions are eligible for the synthesis). If, for ex-
27.3.2 Objectives ample, the comparison is with usual care, it should
be made clear whether this is what the original re-
By setting out the question for their review, the authors searchers regarded as usual care when they did their
will have summarized the scope of their review. In de- study or whether it is what the synthesists would re-
scribing their objectives, they can elaborate on what they gard as usual care at the present time.
are hoping to achieve. For example, if the question for the
Types of outcome measures: what happened to the par-
review is to estimate the relative effects on repeat offences
ticipants? The synthesists might wish to divide these
of two policies for dealing with car thieves, the objectives
into primary and secondary outcome measures, to
may be to identify which of the policies would be the
highlight those outcomes (that is, the primary out-
most appropriate in a particular city.
comes), which they judge to be especially important
in assessing whether or not an intervention is effec-
27.3.3 Eligibility Criteria tive. They should also distinguish between outcomes
or constructs (for example, anxiety) and measures of
As with any scientific research in which information is to these (for example, a particular scale to measure
be collected from past events, a research synthesis needs anxiety). It may also be important to state the timing
to describe how it did this and provide enough informa- of the measurement of an outcome. For example, if
tion to reassure readers that biases that might have arisen postoperative pain is to be an outcome, what time
from prior knowledge of the existing studies were mini- points are relevant to the synthesis? In a research
mized. In the context of a review, the past events or obser- synthesis, the outcomes that are available for the re-
vations are the previous studies and the eligibility criteria port will depend on whether they were measured in
for the review should describe those features that deter- the original studies. This means that the synthesists
mined whether studies would be included or excluded. might not be able to present data on their primary
In the report of the research synthesis, the authors outcomes but they should still make these clear in
should describe what they sought. This is not necessarily their reportthey should not redefine their primary
the same as what they found, which should be described outcomes on the basis of the availability of data.
528 REPORTING THE RESULTS

Types of studies: what research designs will be eligi- 27.4.1 Search Strategy
ble? This might also include a discussion of the
strategies the synthesists will use to assess study The studies in a research synthesis can be considered its
quality, including any characteristics of the studies, raw material or building blocks, and the first step in bring-
beyond their design, which might influence their de- ing these together is the search strategy used to find them.
cision on whether to include it. They should describe Including the search strategy in the report of the synthesis
how they will obtain the information needed to make is important for transparency but also to allow others to
these judgments, because it will not necessarily be reproduce the synthesis. It is unlikely that readers will be
available in the studys published report. For exam- able to replicate fully the search done by the synthesists
ple, if the eligibility criteria require that only ran- because of additions and other changes to any biblio-
domized trials in which allocation concealment was graphic databases in the intervening period. However, the
adequate are included, the synthesists should outline search strategy should be described in enough detail to
what they will do if this information is not readily allow it to be run again. This will make it easier for some-
available, bearing in mind that research has shown one who wishes to update the synthesis to do so by using
that the lack of this information in a report does not the same search strategy to find studies that will have be-
necessarily indicate its absence from the study come available since the original search. It will also be
(Soares et al. 2004). helpful to people who are planning a research synthesis
of a similar topic, because sections of the search strategy
Under these four domains, the authors might use a might be applicable to such a synthesis.
mixture of inclusion and exclusion criteria to describe In describing their search strategy, authors should list
what would definitely be eligible for their synthesis, or the terms used, how these were combined, whether any
what would definitely be ineligible. For example, they limitations were used, where they searched, when, and
might include only randomized trials and exclude studies who did the searching. If journals, conference proceed-
in which the outcomes were not measured for a minimum ings, or other sources apart from electronic databases
period after the intervention. Authors should avoid listing were searched, these should be listed, along with infor-
mirror-image inclusion and exclusion criteria. Little is mation on any searches for unpublished material. Details
gained, for example, from writing both that only prospec- should be given of how many people did the searches,
tive cohort studies are eligible and that retrospective case whether they worked independently, and the procedures
control studies are ineligible. for resolving any disagreements. This information will
help the reader assess the thoroughness of the searching
for the synthesis but, given its technical nature and the
27.4 METHODS potential for the details to be long and unwieldy, authors
might chose to include simply a short list of which data-
The methods section of the synthesis needs to describe bases were searched and when the searches were done.
what was done, but it should also provide information on Including the actual search strategies used for biblio-
what the synthesists had intended to do but were not able graphic databases (which might run to hundreds of lines
to do because of the studies they found. If a protocol for covering both free text searches and the use of index
their synthesis is available, the authors might refer to that, terms) in full in the synthesis will depend in large part on
rather than repeating materialespecially if the protocol space limitations and the target audience for the synthe-
includes plans that could not be implemented in the syn- sis. A reader who is most interested in the implications
thesis (for example, because studies of a certain type for practice of a research synthesis will be much less in-
were not identified or there were inadequate data to per- terested in the details of the search strategy than someone
form subgroup analyses). The methods section should who wishes to use parts of the search strategy for a syn-
cover four topics: search strategy, application of eligibil- thesis of a related topic. Consideration should also be
ity criteria, data extraction or collection, and statistical given to whether the typical reader of the synthesis will
methods. These are discussed in the following section. wish to be confronted by many pages of search terms.
Furthermore, any changes to the methods between the The authors might chose to provide the set of terms used
original plans for the synthesis and the final synthesis for only one database within their report, and note how
should be noted and explained. modifications were made for other databases to reflect
REPORTING FORMAT FOR RESEARCH SYNTHESES 529

different indexing strategies. The full search strategies clude descriptive information about the design and inter-
might be given in appendices, on a website, or made pretations of the information in the report (for example, to
available to readers on request. If the number of records determine if the study is eligible or of adequate quality).
retrieved with each term is included in these lists, future If some of the eligible studies were reported more than
synthesists can consider the usefulness of the individual once, the decision rules used to determine whether a sin-
terms when designing a search for a synthesis covering a gle report or a combination of information from more
similar topic. than one report was used should be stated. Furthermore,
where a study had not reported enough of the information
27.4.2 Application of Eligibility Criteria needed for the synthesis, this section should describe any
steps taken to obtain unpublished data. This will be a chal-
Information should be given on whether one or more peo- lenge for most research syntheses and not all will be able
ple applied the eligibility criteria, whether they worked to obtain the missing data (Clarke and Stewart 1994). The
independently to begin with, and the procedures for re- unpublished data might be for a study as a whole or for
solving any disagreements between them. Some indica- parts of a study (such as unreported subgroups or outcome
tion of whether people working independently tended to measures). If attempts to obtain unpublished information
agree or disagree on their decisions will help readers failed, the synthesists opinions on why they failed might
judge the ease or difficulty of applying the eligibility cri- be helpful. For example, do they believe it was because of
teria, and the possibility that another team of authors too few contact details, too little time, not enough neces-
might have reached different decisions on what, belongs sary data, or unwillingness to provide the data?
within the synthesis. If bias might have been introduced
because of prior knowledge of potentially eligible studies 27.4.4 Statistical Methods
or because of knowledge gathered while searching or ap-
plying eligibility criteria, any procedures to minimize this The methods actually used should be described, along
bias should be described. For example, if the reports of with any methods that were planned but could not be
the studies were modified to prevent the assessors from implemented because of limitations in the data available,
identifying the original researchers, their institution or the unless the reader can be pointed to the description of
source of the article, that they were should be described. these in an earlier protocol. Any nonstandard statistical
This might be achieved by selective printing of pages of methods should be described in detail, as should the syn-
the original reports, by blacking out key information, or thesists rationale for any choices made between alterna-
by cutting text out and making photocopies. Such efforts tive standard methods. Because of the risk of bias arising
might be particularly time consuming and their value in from the conduct of post hoc subgroup or sensitivity anal-
minimizing bias has not been demonstrated (Berlin 1997). yses, the authors should clearly indicate whether their
If the quality of the studies were assessed as part of the analyses were predefined, perhaps by reference to the
application of the eligibility criteria, such procedures protocol for the synthesis if one is available. However, it
should also be described. is worth remembering that subgroup analyses that were a
priori for the research synthesis might have been influ-
27.4.3 Data Extraction or Collection enced by existing knowledge of the findings of the in-
cluded studies. This reinforces the importance of specify-
As with the application of the eligibility criteria, the re- ing the rationale for these analyses in the report and,
port of the research synthesis should include information ideally, showing how they are based on evidence that is
on the number of people who did the data extraction, independent of the studies being included.
whether they worked independently, in duplicate, or as a Sometimes, synthesists will wish to do subgroup or
team, and procedures for resolving any disagreements. sensitivity analyses that were not planned but arose from
Including data on the initial agreement rates between the the data encountered during the synthesis. These analyses
extractors may help readers assess the difficulty of the need to be interpreted with added caution and the authors
data extraction process. This process relates to extracting should clearly identify any such analyses to the reader.
information from the reports of the potentially eligible If the synthesists decide that a quantitative summary of
studies for use in the research synthesis. These data are the results of the included studies is not warranted, they
not simply the numeric results of the study but also in- should still justify any statistical methods used to analyze
530 REPORTING THE RESULTS

the studies at the individual level. If their decision not to results for each individual study are included in the re-
do a quantitative synthesis is based on too much heteroge- port, readers will be able to perform quantitative synthe-
neity among the studies, they should define clearly what ses for themselves, perhaps reconfiguring the groupings
they mean by too much and how this was determined, for of studies the synthesists used. The description of each
example, whether it was done without any consideration study could be provided in tabular form to make it easier
of the results of the studies, on the basis of an inspection for readers to compare and contrast designs. It should in-
of the confidence intervals for each study or a statistical clude details within the four domains noted for setting the
test such as I2 (Higgins et al 2003). This explanation eligibility criteria for the review, as well as details of the
should distinguish between design heterogeneity (for ex- number of people in each study. It is not usually neces-
ample, because the interventions tested in each study are sary to include the conclusions of each study.
very different, the populations vary too greatly or the out- Any studies judged eligible but excluded, or studies
come measures are not combinable) and statistical hetero- readers might have expected to have been included that
geneity (that is, variation in the results of the studies). were excluded because the synthesists found them ineli-
gible, should be reported and the reason for the exclusion
27.4.5 Changes Between Design or Protocol given. If well-known studies excluded from the synthesis
and Synthesis are not mentioned, some readers might assume that the
apparent failure of the synthesists to find them casts doubt
Whether or not the proposed methods for the synthesis on the credibility of the synthesis. In some reports of re-
were written in a formal protocol, the authors should re- search syntheses, the excluded studies might be mentioned
port on any changes they made to their plans after they at the same time as the eligibility criteria for the synthe-
started work on the synthesis. These changes might relate sis, particularly if there are specific reasons relating to
to the eligibility criteria for the studies or aspects of the these criteria that led to the exclusion. Otherwise, they
methods used. The reasons for the changes should be should be discussed within the results section.
given and, if necessary, discussed in the context of whether It is most important that the authors devote adequate
the changes might have introduced bias to the synthesis. space to reporting the primary and secondary analyses
If sensitivity analyses are possible to examine directly the they set out in their methods section, including those with
impact of the changes, these analyses should be described nonsignificant results. This also relates to any a priori sub-
either in this section or under statistical methods (see group or sensitivity analyses. One challenge synthesists
chapter 22, this volume). face is overcoming the publication biases of the original
researchers. Synthesists should not perpetuate such bias
27.5 RESULTS by selectively reporting outcomes measures on the basis
of the results they obtain during the synthesis. Synthesists
The raw material for the synthesis is the studies that have should report all prespecified outcome measures or make
been identified. The results section should describe what this information available in a report supplement. If doing
was found for the synthesis, separating studies judged so is not possible, the synthesists should state that it is not
eligible, ineligible, and eligible but ongoing (with results possible explicitly.
not available). Studies identified but not classified as def- Where quantitative syntheses are reported or shown,
initely eligible or ineligible should be described. the meta-analysts should be explicit about the amount of
In describing the studies, the minimum information to data in each of these combined analyses. This relates to
provide in the report would, of course, be a source of ad- the number of studies, the number of participants, and the
ditional information for each study. This might be a refer- numbers of events in each analysis, and the relative con-
ence to a publication or contact details for the original tribution of each study to the combined analysis. This
researchers. However, this will make it difficult for read- might be done by presenting the separate results of each
ers to understand the similarities and differences between included study, listing the studies and noting how many
studies. A description of each included study is prefera- participants and events each study contributes to each
ble. This is especially true if the synthesis draws conclu- analysis, and giving the confidence intervals for the effect
sions that depend on characteristics of the included stud- estimates. This is particularly important for meta-analy-
ies, so that the reader can determine whether they agree ses in which it is not possible to include all eligible stud-
or disagree with the decisions the synthesists made. If the ies in all analyses. For example, if the research synthesis
REPORTING FORMAT FOR RESEARCH SYNTHESES 531

includes ten studies with 2,000 participants, the reader these to change the results in any important ways should
should not be misled into believing that all of these con- be discussed.
tribute to every analysis if they do not, when some analy-
ses might only be able to draw on a fraction of the 2,000. 27.6.1 Conclusions and Implications
Being specific about which studies are in each meta-anal-
ysis presented in the report also helps readers make their As in the discussion, synthesists should try to avoid intro-
own judgments about the applicability of the results by ducing new evidence in the conclusion to answer the
consulting the details provided for these studies. question for their synthesis. In some circumstances, it
Synthesists should give careful consideration to the will be necessary to draw on evidence that does not come
order in which they present the included studies within from the studies and analyses presented in the results sec-
their own analyses (see chapter 26, this volume). There is tion. In these circumstances, such evidence should ideally
no single correct approach, but some of the options are be included in the background or discussion first. Any
presenting studies in chronological, alphabetic, or size such evidence should be widely accepted and robust. For
order. Similarly, the synthesists should give careful con- example, if a synthesis finds that a particular analgesic is
sideration to the order in which they present their analy- beneficial for treating a mild headache but no evidence
sis. Options here include the order of the importance of was found in the included studies that directly compared
the outcomes, statistical power, size of the effect estimate, different doses of the drug, it would be legitimate to give
or likely impact of the result on practice. information on the standard doses available. Similarly, if
a drug is known to be hazardous for people with kidney
27.6 DISCUSSION problems, the synthesists might include the caution in
their conclusions.
As with any research article, synthesists should be cau- The authors should be explicit about any values or
tious about introducing evidence into the discussion sec- other judgments they use in making their recommenda-
tion not presented earlier in the report. They should avoid tions. If they conclude that one intervention is better than
using the discussion to provide answers to the question another, they should be explicit about whether this arises
that they cannot support with evidence from the studies because of their personal balancing of opposing effects
they found, but may need to contextualize the findings of on particular outcomes. They might decide, for example,
the included studies with additional information from that a slight increase in an adverse outcome is more than
elsewhere. Two examples are discussing the difficulties offset by the benefits on the primary outcome. Readers
of implementing an intervention outside a research set- might not share this judgment, however. In the context of
ting, and contrasting the strength of a relationship identi- a synthesis in which different studies might contribute to
fied within the populations in the included studies with its different outcomes, the authors should be clear about how
likely strength in other populations eligible for the re- a lack of data for some of the outcomes might have influ-
search synthesis but which did not feature in any of the enced their judgments. For example, the review might
included studies. contain reliable and extensive data in favor of an inter-
Any shortcomings of the synthesis, such as the impact vention but the adverse outcomes might not have been
of limitations within the search strategy, should be men- investigated or reported in the same depth in the included
tioned. These might include restrictions to specific lan- studies.
guages or a failure to search the grey literature, which The conclusions for the synthesis can be organized as
might make the synthesis particularly susceptible to pub- two separate sections, setting out the authors opinions
lication bias (Hopewell et al. 2007). Other limitations that on the implications for practice and the implications for
should be discussed include difficulties in obtaining all of research. The former should include any limitations to
the relevant data from the eligible studies and the potential the conclusions if, for example, the general implications
impact of not obtaining unpublished data. The findings of would not apply for some people. The implications for
sensitivity analyses should be discussed, in particular if research should include some comment on whether, in
these indicate that the results of the synthesis would have the presence of the called for research, the research syn-
been different had different methods been used. thesis will be updated. The implications should be pre-
If eligible studies were identified but not included, for sented in a structured way, to help the reader understand
example, because they were ongoing, the potential for the recommendations and to minimize ambiguity. This is
532 REPORTING THE RESULTS

particularly true of the implications for research. Synthe- tionship with evidence of no effect or relationship. The
sists should avoid simplistic statements such as better former can arise from a paucity of data in the included
research is needed. They need to specify what research studies, and the latter usually requires significant data to
is needed, perhaps by following the same structure of in- rule out a small effect.
terventions, participants, outcome measures, and study
designs used in the eligibility criteria (Brown et al. 2006; 27.6.2 Updating the Research Synthesis
Clarke, Clarke, and Clarke 2007).
Simple statements such as We are unable to recom- The first time that the synthesis is done, this section might
mend should be avoided. They are unlikely to be help- include details about the synthesists plans to update it (if
ful, because of their inherent ambiguity, and might arise an update is being considered). This could include any
for a variety of reasons that are less likely in reports of decision rules to initiate the update, such as the passage
individual studies. For example, the synthesis might re- of a certain amount of time, the availability of a new
veal a lack of adequate good quality evidence to make study, or the availability of enough new evidence to pos-
any recommendation. A statement such as There is not sibly change the synthesis findings. After a synthesis has
enough evidence to make any recommendation about the been updated, this section could also be used to highlight
use of would be preferable. Another possibility is that how the update had changed from earlier versions, and
the authors are not allowed to make any recommenda- any plans for future updates. This section is also an op-
tions, regardless of the results, perhaps because of rules at portunity for the synthesists to state whether they wish to
the organization they work for. In such circumstances, the be informed of eligible studies they have not identified.
statement might best be phrased, We are not permitted
to make recommendations. Alternatively, and akin to 27.7 AUTHOR NOTES
what might happen in an individual study, the authors
might feel unable to recommend a particular course of 27.7.1 Author Contributions
action because their synthesis has shown it to be ineffec-
tive or harmful. If this is the case, it would be preferable In the interests of transparency, the work done by each of
for the authors to be explicit and write something such as the authors should be described. This should cover both
We recommend that the intervention not be used. the preparation of any preceding protocol and the report
In a complex or large synthesis of the effects of inter- of the synthesis. It also needs to cover the various tasks
ventions, one way to summarize the conclusions is to necessary to undertake the synthesis. These include issues
group the implications on the basis of which interventions relating to the design of the research synthesis, such as
have been shown to be beneficial, harmful, lacking in the initial formulation of the question and the selection
evidence, needing more research, and probably not war- of the eligibility criteria, as well as the conduct of the
ranting more research. If the synthesis is not about inter- synthesis. Within the conduct of the synthesis, tasks such
ventions but instead covers, for example, the relationship as the design of the search strategy, the identification of
between characteristics of people and their behavior, a studies and the assessment of their eligibility, the extrac-
similar categorization of the conclusions but on a differ- tion of data, its analysis, and conclusions drawn might be
ent basison the basis of confidence about a positive or undertaken by different people within the author team.
negative relationship, one that is unproven and worthy of All authors should agree on the content of the final report,
more research, and one that is unproven but does not jus- but one or more should be identified as its guarantors, be-
tify more researchcould be used. cause some authors might not feel able to take full re-
Synthesists should avoid broad statements that there is sponsibility for technical aspects that are outside of their
no evidence for an effect or a relationship if they actually skill base. As an example, an information specialist might
mean that there is no evidence from the types of study feel competent to take responsibility for the searching and
they have synthesized. This is important if other types of a statistician for the meta-analyses, but not the reverse.
evidence would have shown an effect or relationship. As
an example, there might be no randomized trials for a 27.7.2 Acknowledgments
particular treatment despite its effectiveness having been
shown by other types of study. Similarly, synthesists Some people may have worked on the synthesis, or may
should avoid confusing no evidence of an effect or rela- have provided information for it, but have not made a
REPORTING FORMAT FOR RESEARCH SYNTHESES 533

substantial enough contribution to be listed as an author. people making decisions relating directly to themselves
Such individuals should be acknowledged and, if named, or other individuals, people making policy decisions at a
should give their permission to be named. In research regional or national level, and people planning new re-
synthesis, this might include the original researchers who search. The abstract can be used to highlight whether the
responded to questions about their studies or provided different potential audiences will find things of relevance
additional data, or people within the research synthesis within the full report. Alternatively, the authors might
team who worked on one specific aspect of the project, need to prepare more than one abstract or summary to
such as the design and conduct of electronic searches, but cater to the different audiences. Ideally, the abstract should
did not otherwise contribute. be structured to draw attention to the inclusion criteria for
the review, the scope of the searching, the quantity of re-
search evidence included, the results of any meta-analy-
27.7.3 Funding ses of the primary outcomes, and the main implications.
Sources of funding for the synthesis should be listed. If
any conditions were associated with this funding that 27.9 SUMMARY CHECKLIST
might have influenced the conduct of the synthesis, its
findings, or its conclusions, these should be stated so as to The Working Group on American Psychological Asso-
inform readers of the potential impact of the funding. The ciation Journal Article Reporting Standards (the JARS
sources of funding might be separated into those pro- Group) arose out of a request from the APA Publications
vided explicitly for the preparation of the synthesis; those and Communications Board to examine reporting stan-
supporting a synthesists work, which includes the time dards for both individual research studies and meta-anal-
needed to do research; and indirect funding that supported ysis published in APA journals. For meta-analysis, the
editorial or other input. JARS Group began by considering several efforts to es-
tablishing reporting standards for meta-analysis in health
care. These included QUOROM or Quality of Reporting
27.7.4 Potential Conflict of Interest of Meta-analysis (Moher et al. 1999) and the Potsdam
consultation on meta-analysis (Cook, Sackett, and Spitzer
The authors should describe any conflicts of interest that 1995). The JARS Group developed a combined list of el-
might have influenced the conduct, findings, conclusions ements from these, rewrote some items with social scien-
or publication of the synthesis. It is important to discuss tists in mind, and added a few suggestions of its own.
any potential conflicts of interest that readers might per- After vetting their effort, the JARS Group arrived at the
ceive to have influenced the synthesis, to provide reas- list of recommendations contained in table 1 (Journal Ar-
surance that this is not the case. In a research synthesis, ticle Reporting Standards Working Group 2007).
this might include the involvement of the authors in any
of the studies considered for the synthesis, because their
special knowledge of these studies and access to the data 27.10 REFERENCES
is likely to be different to what was available for the other Allen, Claire, Mike Clarke, and Prathap Tharyan. 2007. In-
studies. ternational Activity in the Cochrane Collaboration with
Particular Reference to India. National Medical Journal of
27.8 ABSTRACT AND OTHER SUMMARIES India 20(5): 25055.
Berlin, Jesse A. 1997. Does Blinding of Readers Affect the
When a synthesis is published, its abstract is likely to ap- Results of Meta-Analyses? University of Pennsylvania
pear at the beginning. But, as with most articles, the ab- Meta-Analysis Blinding Study Group. Lancet 350(9072):
stract is likely to be one of the last things to be written. 18586.
Hence the position of this guidance near the end of this Brown, Paula, Karla Brunnhuber, Kalipso Chalkidou, Iain
chapter. In preparing any summary of a piece of research, Chalmers, Mike Clarke, Mark Fenton, Carol Forbes, Julie
the authors need to give careful consideration to the audi- Glanville, Nicholas J. Hicks, Janet Moody, Sarah Twaddle,
ence and whether they are likely to read (or even have Hazim Timimi, and Paula Young. 2006. How to Formulate
access to the full synthesis). As noted earlier, many dif- Research Recommendations. British Medical Journal
ferent audiences for a research synthesis are possible 333(7572): 8046.
534 REPORTING THE RESULTS

Clarke, Lorcan, Mike Clarke, and Thomas Clarke. 2007. Hopewell, Sally,Steve McDonald, Mike Clarke, and Mat-
How Useful Are Cochrane Reviews in Identifying Re- thias Egger. 2007. Grey Literature in Meta-Analyses of
search Needs? Journal of Health Services Research and Randomized Trials of Health Care Interventions. Cochrane
Policy 12(2): 1013. Database of Systematic Reviews 2:MR000010.
Clarke, Mike J., and Lesley A. Stewart. 1994. Obtaining Moher, David, Deborah J. Cook, Simon Eastwood, Ingrim
Data from Randomised Controlled Trials: How Much Do Olkin, Drummond Rennie, and David F. Stroup. 1999. Im-
We Need to Perform Reliable and Informative Meta-Analy- proving the Quality of Reports of Meta-Analyses of Ran-
ses? British Medical Journal 309(6960): 100710. domised Controlled Trials: The QUOROM Statement.
Cook, Deborah J., David L. Sackett, and Walter O. Spitzer. Lancet 354(9193): 18961900.
1995. Methodologic Guidelines for Systematic Reviews Moher, David, Jennifer Tetzlaff, Andrea C. Tricco, Margaret
of Randomized Control Trials in Health Care from the Sampson, and Douglas G. Altman. 2007. Epidemiology
Potsdam Consultation on Meta-Analysis. Journal of Clini- and Reporting Characteristics of Systematic Reviews. PLoS
cal Epidemiology 48(1): 16771. Medicine 4(3): e78.
Halvsorsen, Katherine T. 1994. The Reporting Format. In Journal Article Reporting Standards Working Group [JAR-
The Handbook of Research Synthesis, edited by Harris SWG]. 2008. Reporting Standards for Research in Psy-
Cooper and Larry V. Hedges. New York: Russell Sage chology: Why Do We Need Them? What Might They Be?
Foundation. Washington, D.C.: American Psychological Association.
Higgins, Julian P.T., and Sally Green, eds. Cochrane Hand- Soares, Helios P., Stephen Daniels, Ambuj Kumar, Mike
book for Systematic Reviews of Interventions, vers. 5. Clarke, Charles Scott, Sandra Swann, and Benjamin Djul-
Chichester: John Wiley & Sons. http://www.cochrane.org/ begovic. 2004. Bad Reporting Does Not Mean Bad
resources/handbook/hbook.htm Methods for Randomized Trials: Observational Study of
Higgins, Julian P.T., Simon G. Thompson, Jonathan J. Deeks, Randomised Controlled Trials Performed by the Radiation
and Douglas G. Altman. 2003. Measuring Inconsistency Therapy Oncology Group. British Medical Journal
in Meta-Analyses. British Medical Journal 327(7414): 328(7430): 2225.
55760.
28
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES

GEORG E. MATT THOMAS D. COOK


San Diego University Northwestern University

CONTENTS
28.1 Introduction 538
28.1.1 Why We Conduct Research Syntheses 539
28.1.2 Validity Threats 540
28.1.3 Generalized Inferences 540
28.1.3.1 Demonstrating Proximal Similarity 541
28.1.3.2 Exploring Heterogeneous and Substantively Irrelevant
Third Variables 541
28.1.3.3 Probing Discriminant Validity 541
28.1.3.4 Studying Empirical Interpolation and Extrapolation 541
28.1.3.5 Building on Causal Explanations 542
28.2 Threats to Inferences About the Existence of an Association Between
Treatment and Outcome Classes 543
28.2.1 Unreliability in Primary Studies 543
28.2.2 Restriction of Range in Primary Studies 543
28.2.3 Missing Effect Sizes in Primary Studies 544
28.2.4 Unreliability of Codings in Meta-Analyses 544
28.2.5 Capitalizing on Chance in Meta-Analyses 545
28.2.6 Biased Effect-Size Sampling 545
28.2.7 Publication Bias 545
28.2.8 Bias in Computing Effect Sizes 546
28.2.9 Lack of Statistical Independence 546
28.2.10 Failure to Weight Effect Sizes Proportionally to Their Precision 547
28.2.11 Underjustified Use of Fixed or Random Effects Models 547
28.2.12 Lack of Statistical Power for Detecting an Association 548

537
538 SUMMARY

28.3 Threats to Inferences About the Causal Nature of an Association


Between Treatment and Outcome Classes 548
28.3.1 Absence of Studies with Successful Random Assignment 549
28.3.2 Primary Study Attrition 549
28.4 Threats to Generalized Inferences 549
28.4.1 Sampling Biases Associated with Persons, Treatments,
Outcomes, Settings, and Times 549
28.4.2 Underrepresentation of Prototypical Attributes 550
28.4.3 Restricted Heterogeneity of Substantively Irrelevant Third
Variables 550
28.4.4 Mono-Operation Bias 551
28.4.5 Mono-Method Bias 551
28.4.6 Rater Drift 551
28.4.7 Reactivity Effects 551
28.4.8 Restricted Heterogeneity in Universes of Persons,
Treatments, Outcomes, Settings, and Times 552
28.4.9 Moderator Variable Confounding 552
28.4.10 Failure to Test for Homogeneity of Effect Sizes 553
28.4.11 Lack of Statistical Power for Homogeneity Tests 554
28.4.12 Lack of Statistical Power for Studying Disaggregated Groups 554
28.4.13 Misspecification of Causal Mediating Relationships 554
28.4.14 Misspecification Models for Extrapolation 555
28.5 Conclusions 555
28.6 References 557

28.1 INTRODUCTION whether the association holds with specific populations


of persons, settings, times and ways of varying the cause
This chapter provides a nonstatistical way of summariz- or measuring the effect; holds across different popula-
ing many of the main points in the preceding chapters. In tions of people, settings, times, and ways of operational-
particular, it takes the major assumptions outlined and izing a cause and effect; and can even be extrapolated to
translates them from formal statistical notation into ordi- other populations of people, settings, times, causes, and
nary English. The emphasis is on expressing specific vio- effects than those studied to date. These are all general-
lations of formal meta-analytic assumptions as concretely ization tasks that researchers face, perhaps no one more
labeled threats to valid inference. This explicitly inte- explicitly than the meta-analyst.
grates statistical approaches to meta-analysis with a falsi- It is easy to justify why we translate violated statistical
ficationist framework that stresses how secure knowledge assumptions into threats to validity, particularly threats
depends on ruling out alternative interpretations. Thus, to the validity of conclusions regarding the generality of
we aim to refocus readers attention on the major ratio- an association. The past twenty-five years of meta-ana-
nales for research synthesis and the kinds of knowledge lytic practice have amply demonstrated that primary stud-
meta-analysts seek to achieve. ies rarely present a census or even a random sample of
The special promise of meta-analysis is to foster em- the populations, universes, categories, classes, or entities
pirical knowledge about general associations, especially (terms we use interchangeably) about which generaliza-
causal ones, that is more secure than what other methods tions are sought. The salient exception is when random
typically warrant. In our view, no rationale for meta-anal- sampling occurs from some clearly designated universe,
ysis is more important than its ability to identify the realm a procedure that does warrant valid generalization to the
of application of a knowledge claimthat is, identifying population from which the sample was drawn, usually a
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 539

human population in the social sciences. But most sur- sample of studies is itself unbiased. This chapter proposes
veys take place in decidedly restricted settings (a living a threats-to-validity approach rooted in a theory of con-
room, for instance) and at a single time, and the relevant struct validity as one way to throw provisional light on
cause and effect constructs are measured without ran- how to justify general inferences.
domly selecting items. Moreover, many people are not
interested in the population a particular random sample 28.1.1 Why We Conduct Research Syntheses
represents, but ask instead whether that same association
hold with a different kind of person, in a different setting, At the core of every research synthesis is an association
at a different time, or with a different cause or effect. about which we want to learn something that cannot be
These questions concern generalization as extrapolation gleaned from a single existing study. These associations
rather than representation (Cook 1990). How can we ex- can be of many kinds, such as the relationship between a
trapolate from studied populations to populations with risk factor (for example, secondhand smoke exposure)
many, few, or even no overlapping attributes? and a disease outcome (for example, lung cancer), between
The sad reality is that the general inferences meta-anal- a treatment (for example, antidepressant medication) and
ysis seeks to provide cannot depend on formal sampling an effect (for example, suicide), between a predictor (for
theory alone. Other warrants are also needed. This chapter example, SAT) and criterion (for example, college GPA),
assumes that ruling out threats to validity can serve as one or between a dose (for example, hours of psychother-
such warrant. Doing so is not as simple or as elegant as apy) and either a single response (for example, psycho-
sampling with known probability from a well-designated logical distress) or a response change (for example, obe-
universe, but it is more flexible and has been used with sity levels).
success to justify how manipulations or measures are cho- Many primary studies have relatively small sample
sen to represent cause and effect constructs (that is, con- sizes and thus low statistical power to detect a meaning-
struct validity). If meta-analysis is to deal with general- ful population effect. It is thus commonly hoped that
ization understood as both representation and extrapolation, meta-analysis will increase the precision with which an
we need ways of using a particular database to justify rea- association is estimated. The relevant intuition here is
sonable conclusions about what the available samples that combining estimates from multiple parallel studies
represent and how they can be used to extrapolate to other (or exact replications) will increase the total number of
kinds of persons, settings, times, causes, and effects. primary units for analysis (for example, persons), thus
This chapter is not the first to propose that a framework reducing the sampling error of an association estimate.
of validity threats allows us to probe the validity of re- However, parallel studies are rare. More typical are stud-
search inferences when a fundamental statistical assump- ies of an association that differ in many ways, including
tion has been violated. Donald Campbell introduced his how samples of persons were selected, such that it is
internal validity threats for instances when primary stud- unreasonable to assume they come from the same popu-
ies lack random assignment, creating quasi-experimental lation. So we need to investigate how inferences about
design as a legitimate extension of the thinking R. A. associations are threatened when a meta-analysis is con-
Fisher had begun (Campbell 1957; Campbell and Stanley ducted. This is especially important when the studies are
1963). Similarly, this chapter seeks to identify threats to few in number and heterogeneous in the samples of per-
valid inferences about generalization that arise in meta- sons, settings, interventions, outcomes, and times. In our
analyses, particularly those that follow from infrequent experience, most associations with which meta-analysts
random sampling. Of course, Donald Campbell and Ju- deal are causal, of the form that manipulating one entity
lian Stanley also had a list of threats to external validity, (that is, treatment) will induce variability in the other
and these also have to do with generalization (1963). But (that is, outcome). So a second reason for conducting re-
their list was far from complete and was developed more search syntheses is to examine how causal inference is
with primary studies in mind than with research synthe- threatened or enhanced in meta-analyses. A third ratio-
ses. The question this chapter asks is how one can pro- nale is to strengthen the generalization of an association
ceed to justify claims about the generality of an associa- (causal or otherwise) that has been examined across the
tion when the within-study selection of persons, settings, available primary studies, each with its own time frame,
times, and measures is almost never random and when it means of selecting persons and settings, and of imple-
is also not even reasonable to assume that the available menting the treatment and outcome measures.
540 SUMMARY

In meta-analytic practice, conclusions are rarely war- encounter sampling bias. So when two labels are possible
ranted by statistical sampling theory or even strong ex- we prefer to formulate a threat in the form closest to ac-
planatory theories. Instead, heavy reliance is placed on tual meta-analytic practice.
less familiar principles for exploring generalizations A list of validity threats can be used in two major ways.
about specific target universes, for identifying factors that Retrospectively, it can help discipline claims emerging
might moderate and mediate an association, and for ex- from a research synthesis by identifying which threats are
trapolating beyond whatever universes are in the avail- plausible in light of current knowledge and by indicating
able database (Cook 1990, 1993). We need to examine the additional evidence needed to assess the plausibility
these other, less well-known principles and the threats to of a threat and thereby evaluate how well the causal claim
validity that can be extracted from them. is justified. Second, validity threats can also be used pro-
spectively for planning a research synthesis. The purpose
28.1.2 Validity Threats then is to anticipate as many potential threats as possible
and to design the research so as to rule them out. Identify-
In the meta-analytic context, validity threats describe fac- ing threats prospectively is better than retrospectively,
tors that can induce spurious inferences about the exis- though each is valuable.
tence or size of an association, about the likelihood it is
causal, or about its generalizability. Meta-analytic infer-
ences are more compelling, the better the identified threats 28.1.3 Generalized Inferences
have been ruled out, either because they are empirically
implausible or because other evidence indicates they did The generalized inferences possible in a research synthe-
not operate in the research under consideration. Ruling sis are different from those in a primary study because, in
out alternatives requires a framework of validity threats, the latter, inferences are inextricably limited to the rela-
some taken from formal statistical models, others from tively narrow ways in which times, research participants,
theories about generalization, and others emerging as em- settings, and cause and effects are usually sampled or
pirical products born of critical reflection on the conduct measured. In contrast, research syntheses involve multi-
of research syntheses. Any list of validity threats so con- ple samples of diverse persons and settings as well as nu-
structed is bound to be subject to change. Indeed, such merous treatment implementations and outcome mea-
change is desirable, reflecting improvements in theories sures, collected at unique times and over varying periods.
of method and in critical discourse about the practice of This potential for heterogeneous sampling provides the
research synthesis. New threats should emerge as others framework for generalizing to broader classes or uni-
die out, because they are rare in actual research practice. verses than a single study makes possible. But why is
Most of the threats listed here are in the previous lists of this, given that the most secure path to generalization
Georg Matt and Thomas Cook (1994) and William Shad- formal sampling theorydoes not generate the hetero-
ish, Thomas Cook, and Donald Campbell (2002), but geneity that meta-analysts typically find and use for gen-
some are new and a few have dropped out. eralization purposes? Interestingly, even advocates of
Some threats relevant to research syntheses also apply random selection use purposive selection for choosing
to individual primary studies, such as unreliability in the items they put into their surveys. In so doing they rely
measuring outcomes. Others are extensions to threats that on theories of construct validity to justify generalizing
apply to primary studies, but are now reconfigured to re- from the items selected to the more general constructs
flect how a threat is distributed across studies, such as the these measures are intended to represent. Our theory of
failure to find any studies that assign treatment at random. generalization in the meta-analytic context depends on
Other threats are unique to meta-analyses because they factors ultimately derived from theories of construct va-
depend on how estimates of association are synthesized lidity rather than formal sampling. Each of these theories
across single studies, such as publication bias. These last uses sampling particulars to justify inferences about ab-
have higher-order analogs in general principles of re- stract entities. Construct validity uses systematic sam-
search design so that publication bias can be considered a pling to warrant inferences about the constructs a mea-
particular instance of the more general threat of sampling sure or manipulation represents. Sampling theory uses
bias. However, publication bias is application focused, random sampling to warrant generalizing from a sample
one of the highly specific ways in which meta-analysts to the populations it represents. How then does construct
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 541

validity justify general inferences? Thomas Cook pro- pend on largely or totally irrelevant features that happen to
posed five useful principles (1990, 1993). be correlated with the prototypical features. Research syn-
28.1.3.1 Demonstrating Proximal Similarity Don- theses benefit from heterogeneous implementations across
ald Campbell first introduced the principle of proximal primary studies because this makes the irrelevancies as-
similarity in the context of construct validity (1986). For sociated with each construct heterogeneous. In the rela-
research syntheses, proximal similarity refers to the cor- tively rare case when a synthesis depends on primary stud-
respondence in attributes between an abstract entity, such ies that share the same substantive irrelevancy, then that
as an outcome construct or a class of settings, about which attribute does not vary and is confounded with the target
inferences are sought and the particular instances of the class. For instance, if all the Asian Americans in a particu-
entity captured by the studies included in the meta-analy- lar set of studies were Vietnamese or male, then the gen-
sis. To make inferences about target entities, meta-ana- eral Asian American category would be confounded with
lysts first seek to identify which of the potentially relevant a single nation of origin or with a single gender. Similarly,
studies are indeed members of the target class. That is, if an association is limited to studies of public schools that
their attributes include those that theoretical analysis in- are all in affluent suburbs with volunteer participants, then
dicates are prototypical of the construct being targeted public schools are confounded with both the suburban lo-
(Rosch 1973, 1978; Smith and Medin 1981). For exam- cation and the volunteer respondents. If all the measures
ple, if the target population is of personslet us say Asian of productivity tap into the quantity but not quality of what
Americansthen we need to know whether the partici- has been produced, then the productivity concept would
pants have ancestors who come from certain nations. If be confounded with its quantity component. The obvious
we want to generalize to factories, we want to know requirement for unconfounding the targets of interest and
whether the research actually took place in factory set- the way they are assessed is ensuring that the studies sam-
tings. Do the studies all occur in the 1980sthe histori- pled for meta-analysis vary in the country of origin among
cal period to which generalization is sought? Are the in- Asian Americans, in location among public schools, and
terventions instances of the category labeled decentralized in how productivity is measured.
decision making, in that all affected parties are repre- 28.1.3.3 Probing Discriminant Validity Even ruling
sented in deliberations about policy? Are the outcome out the role of theoretical irrelevancies is not enough.
operations plausible measures of the higher order con- Any relationship that is not robust across substantively
struct productivity, in that a physical output is measured relevant differences invites the analyst to identify the
and related to standards about quantity or quality? These conditions that moderate or mediate the association. Re-
kinds of questions about category membership are obvi- search syntheses include primary studies with their own
ous and the sine qua non of any theory of representation. populations of persons, their own categories of setting,
Alas, though, category membership justified by proximal their own period, and their own theoretical descriptions
similarity is not sufficient for inferring that the sampled of what a treatment or outcome class is and how it should
instances represent the target class. More is needed. be manipulated or measured. The issue is not then to gen-
28.1.3.2 Exploring Heterogeneous and Substan- eralize to a single common population; it is to probe how
tively Irrelevant Third Variables In his theory of con- general the relationship is across different populations
struct validity, Donald Campbell also required ruling out and to determine the specific conditions under which it
theoretical and methodological irrelevancies (1986). These varies. Does the effect of antidepressant medication hold
are features of the sampled particulars that substantive in children, adolescents, adults, and the elderly? Is the
experts deem irrelevant to a category but that can be part effect of secondhand smoke different in pre- and post-
of the sampling particulars used to represent that cate- menopausal women? Is the effect of psychotherapy con-
gory. Analysts must therefore have a theory of which as- ducted in clinical practice settings different from that in
pects of a target entity are prototypical, or central to its university research settings? Questions such as these ad-
understanding, as well as aspects that are peripheral and dress the generalizability of meta-analytic findings by
do not make much of a conceptual difference. If it can be probing their discriminant validity.
shown that these substantive irrelevancies make no dif- 28.1.3.4 Studying Empirical Interpolation and Ex-
ference, then an association is deemed robust with respect trapolation The preceding discussion deals only with
to these features. As a result, the generalization is strength- situations where the synthesized studies contain in-
ened because the association being tested does not de- stances of the classes about which we want to generalize.
542 SUMMARY

However, in some cases, we seek to extrapolate beyond grounds that ameliorative processes will be discovered
the available data and draw conclusions about domains that can be instantiated in many ways, making them via-
not yet studied (Cronbach 1982). This is an intrinsically ble across a wide range of persons and settings today and
difficult enterprise, but even here it is plausible to assert tomorrow. Research syntheses can enhance our under-
that research syntheses have advantages over primary standing of conditions mediating the magnitude and di-
studies. Consider the case of a relationship that is demon- rection of an association because the variety of persons,
strably robust across a wide range of conditions. The ana- settings, times and measures available in the database
lyst may then want to induce that the same relationship is will be greater than that any single study can capture. The
likely to hold in other, as-yet unstudied, circumstances crucial issue then becomes what the obstacles (that is,
given that it has held so robustly over the wide variety of threats) are that arise when using meta-analysis to extrap-
circumstances already examined. Consider, next, the case olate findings by identifying the processes by which a
in which some specific causal contingencies have been class of interventions produces an effect. This is a diffi-
identified for which the association does and does not cult issue to pursue because explicit examples of causal
hold. This knowledge can then be used to help identify explanation remain rare in meta-analysis because the con-
specific situations in which extrapolation is not war- cern has been more with identifying causal relationships
ranted, the more so if a general theory can be adduced to and some of their moderating rather than mediating fac-
explain the variation in effectiveness. Now, imagine a tors (for example, Becker 2001; Cook et al. 1992; Harris
sample of studies that is relatively homogeneous on some and Rosenthal 1985; Premack and Hunter 1988; Shadish
or all attributes. There is then no evidence of the associa- and Sweeney 1991). But even so, moderation provides
tions empirical robustness or of the specific circum- clues to mediation through the way results are patterned;
stances that moderate when it is observed, making induc- and many meta-analyses do include individual studies
tive extrapolation to other universes insecure. Of course, making claims about causal mediation. It would be stretch-
conceptual examination is always required to detect mod- ing the point to claim that detecting causal mediation is a
erating variables, even with a heterogeneous collection of strength of meta-analysis today. It is not overly ambi-
studies. For instance, many pharmaceutical drugs are tious, however, to contend that the potential for explana-
submitted to regulatory agencies for approval with adults tion is greater in meta-analyses than in individual studies,
and may not have been tested with adolescents, young even explicitly explanatory ones.
children, or the elderly. Yet individual physicians make The five principles we have just enumerated for
off-label prescriptions where they extrapolate from ap- strengthening generalized inference in meta-analyses are
proved groups, such as adults, to those that have not been not independent. None is sufficient for generalized causal
studied, such as adolescents, children, and the elderly. In- inference as understood here, and probably only proxi-
deed, some experts estimate that 80 to 90 percent of pedi- mal similarity is necessary. However, taken together, they
atric patients are prescribed drugs off-label, requiring describe current efforts at generalization within a synthe-
thoughtful physicians to assume that the relevant physi- sis framework, and they are warranted through their link
ological processes do not vary by agean assumption to the generalization afforded by theories of construct va-
that may or may not be true for a given pharmaceutical lidity rather than formal sampling or well-corroborated
agent (Tabarrok 2000). Extrapolation depends here, not and well-demarcated substantive theory. Taken together,
just on the robustness of sampling plans and results, but the principles suggest feasible strategies for exploring
also on substantive theory about age differences in physi- and justifying conclusions about the generalization of an
ological processes. association. They are more relevant for conclusions about
28.1.3.5 Building on Causal Explanations In the what the sampling particulars represent and what the lim-
experimental sciences, generalization is justified, not its of generalization might be, however, than they are for
from sampling theory, but from attempts to achieve com- the intrinsically more problematic task of extrapolating to
plete understanding of the processes causally mediating unstudied entities.
an association. The crucial assumption is that, once these Our discussion follows from the benefits research syn-
processes have been identified, they can be set in motion thesis promise. We first discuss threats relevant to meta-
by multiple causal agents, including some not yet studied analyses that primarily seek to describe the degree of
but plausible on theoretical or pragmatic grounds. Indeed, association between two variables, and those to the infer-
the public is often asked to pay for basic science on the ence that a relationship is causal in the manipulability or
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 543

activity theory sense (Collingwood 1940; Gasking 1955; Table 28.1 Threats to Inferences About Existence
Whitbeck 1977). We then turn our attention to general- of Association Between Treatment and
ization, beginning with those validity threats that apply Outcome Classes
when generalizing to particular target populations, con-
structs, categories, and the like. We examine generaliza- (1) Unreliability in primary studies
tion threats pertinent to moderator variable analyses that (2) Restriction of range in primary studies
concern the empirical robustness of an association and (3) Missing effect sizes in primary studies
thus affect its strength and even direction (Mackie 1974). (4) Unreliability of codings in meta-analyses
(5) Capitalizing on chance in meta-analyses
Finally, we examine those generalization threats that (6) Biased effect size sampling
apply when meta-analysis is used to extrapolate to novel, (7) Publication bias
as-yet unstudied universes. Threats discussed in the ear- (8) Bias in computing effect sizes
lier sections almost always apply to later sections that are (9) Lack of statistical independence among effect sizes
devoted to the increasingly more complex inferential (10) Failure to weight effect sizes proportional to their
tasks of causation and generalization. To avoid repetitive- precision
ness, we discuss only the unique threats relevant to the (11) Underjustified use of fixed or random effects models
later sections. (12) Lack of statistical power for detecting an association

source: Authors compilation.


28.2 THREATS TO INFERENCES ABOUT THE
EXISTENCE OF AN ASSOCIATION BETWEEN
TREATMENT AND OUTCOME CLASSES had the error been constant across studies (Hunter and
Schmidt 1990; Rosenthal 1991; Wolf 1990); see also
The following threats deal with issues that may lead to chapter 17, this volume). There are limits, however, to
erroneous conclusions about the existence of a relation- what can be achieved because the implementation of in-
ship between two classes of variables, including an inde- terventions is less understood than the measurement of
pendent (that is, treatment) and dependent variable (that outcomes, and is rarely documented well in primary stud-
is, outcome). See table 28.2 for a list of these threats. In ies. Some primary researchers do not even report the reli-
hypothesis testing language, these threats lead to type I or ability of the scores on the outcomes they measured. These
type II errors emanating from deficiencies in either the pri- practical limitations impede comprehensive attempts to
mary studies or the research synthesis process. A solitary correct individual effect size estimates, and force some
deficiency in a single study is unlikely to jeopardize meta- meta-analysts either to ignore the impact of unreliability
analytic conclusions in any meaningful way. More prob- or to generate (untested) assumptions about what the
lematic is whether a specific deficiency operates across all missing reliability might have been in a given study.
or most of the studies being reviewed or whether different
deficiencies fail to cancel each other out across studies, 28.2.2 Restriction of Range in Primary Studies
such that one direction of bias predominates, either to
over- or underestimate an association (see chapter 8, this When the range of a treatment or predictor variable is
volume). The need to rule out these possibilities leads to restricted, effect-size estimates are reduced. In contrast, if
the list of validity threats in table 28.1. the outcome variable range is restricted, the within-group
variability is reduced. Restricting range on the outcome
28.2.1 Unreliability in Primary Studies will otherwise decrease the denominator of the effect size
estimate, d, and thus increase effect-size estimates. For
Unreliability in implementing or measuring treatments example, if a weight-loss researcher limits participation
and in measuring outcomes attenuates the effect-size es- to subjects who are at least 25 percent but not more than
timates from primary studies and the average effect-size 50 percent overweight, within-group variability in weight
estimates computed across such studies. Attenuation cor- loss will likely drop, increasing the effect size estimate by
rections have been proposed and, if their assumptions are decreasing its denominator. Thus, range restrictions in
accepted, these corrections may be used to provide esti- primary studies can attenuate or inflate effect-size esti-
mates that approximate what would have happened had mates, depending on which variable is restricted. As Wolf-
treatments and outcomes been observed without error, or gang Viechtbauer pointed out, two standardized effect
544 SUMMARY

sizes based on the same outcome measures from two involve examining whether studies with little or no miss-
studies could be incommensurable if the samples were ing data yield comparable findings to studies with more
drawn from populations with different variances (2007). missing data or missing data patterned a particular way.
This problem can be avoided if the raw units are consis- Another approach is to pursue the imputation strate-
tent across the studies of a meta-analysis, making it un- gies discussed in chapter 21 of this handbook, though it is
necessary to standardize the effect sizethe case when clear that they hold only under certain model assump-
Jean Twenge and Keith Campbell used the Rosenberg tions. When these assumptions cannot be convincingly
Self-Esteem Scale and the Coopersmith Self-Esteem In- justified, the impact of these procedures must be ques-
ventory to conduct a meta-analysis of birth cohort differ- tioned. An example comes from early meta-analyses of
ences in adults and children, respectively (2001). the effects of psychotherapy, where it was assumed that
Given the possibility of counterbalancing range restric- the most likely estimate for a missing effect size was
tion effects within individual studies and across all stud- zero, based on the assertion that researchers failed to re-
ies in a research synthesis, it is not easy to estimate the port only those outcomes that did not differ from zero
overall impact of range restriction. Nonetheless, meta- (that is, were not statistically significant). Though this as-
analysts have suggested adjustments to mean estimates of sumption has never been comprehensively tested, David
correlations (and their standard error) to control for the Shapiro and Diana Shapiro coded 540 unreported effect
attenuation due to range restriction (see chapter 17, this sizes as zero and then added them to 1,828 reported ef-
volume). When valid estimates of the population range or fects, reducing the average effect from .93 to .72 (1982).
variance are available, these adjustments provide reason- But effect sizes may be unreported for many other rea-
able estimates of the correlation that would be observed sonsfor example, when they are negative, or positive
in a population with such variance. Adjusting for range but associated with unreliable measuresraising concern
restriction in a manipulated variable (that is, a carefully about the assumption that they average zero. For a gen-
planned intervention) for which population values may eral introduction to multiple imputation methods, see
not be known, however, is more problematic. Roderick Little and Donald Rubin (2002), Donald Rubin
(1987), and Sandip Sinharay et al. (2001). For applica-
28.2.3 Missing Effect Sizes in Primary Studies tions in meta-analysis, see Alexander Sutton (2000),
George Kelley (2004), William Shadish et al. (1998), and
Missing effect sizes occur when study reports fail to in- Carolyn Furlow and Natasha Beretvas (2005).
clude findings for all the sampled groups, outcome mea-
sures, or times. This sometimes happens because of space 28.2.4 Unreliability of Codings in Meta-Analyses
limitations in a journal or because a research report is
restricted to only part of the overall study. Some effect Meta-analytic data are the product of a coding process
sizes also go unreported because the findings were not susceptible to human error. Unreliability at the level of
statistically significant, thus inflating the average effect research synthesis (unreliable determination of means,
size from a meta-analysis. Otherwise, the impact of unre- standard deviations, and sample sizes, for example) is
ported effect sizes will vary, depending on why an author not expected to bias average effect size estimates. This is
decided to include some findings and exclude others. because in classical measurement theory, measurement
To prevent this kind of bias, it is important to code the error is independently distributed and uncorrelated with
most complete version of a report (such as dissertations true scores. It will, however, inflate the variance of the
and technical reports rather than published articles) and observed effect sizes, increasing estimates of standard
to contact study authors to obtain additional information, error and reducing statistical power for hypothesis tests.
as Steven Premack and John Hunter (1988) and George David Wilson discusses several strategies for controlling
Kelley, Kristi Kelley, and Zung Vu Tran (2004) did with and reducing error in coding (chapter 9, this volume). In
some success. However, this last strategy is not always our experience pilot testing the coding protocol, compre-
feasible if authors cannot be found or no longer have the hensive coder training, engaging coders with expertise in
needed information. Meta-analysts must then code re- the substantive area being reviewed, consulting external
search reports for evidence of missing effect sizes and literature, contacting primary authors, using reliability
use sensitivity analyses to explore how the data known to estimates as controls, generating confidence ratings for
be missing might have affected overall findings. This can individual codings, and conducting sensitivity analyses
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 545

are all helpful strategies for reducing and controlling for yses should also explore for possible coder differences
error in data coding. in results.

28.2.5 Capitalizing on Chance in Meta-Analyses 28.2.7 Publication Bias


Although research syntheses may combine findings from Some studies are conducted but never written up; of those
hundreds of studies and thousands of respondents, they written up, some are not submitted for publication; and of
are not immune to inflated type I error when many statis- those submitted, some are never published. Publication
tical tests are conducted without adequate control for bias exists when the average effect estimate from pub-
error ratea problem exacerbated in research syntheses lished studies differs from that of the population of stud-
with few studies. Typically, meta-analysts conduct many ies ever conducted on the topic. Anthony Greenwald and
analyses as they probe the robustness of an effect size Robert Rosenthal have both argued that published studies
across various methodological and substantive character- in the behavioral and social sciences are likely to be a bi-
istics that might moderate effect sizes, and as they other- ased sample of all the studies actually carried out, because
wise explore the data. To reduce capitalizing on chance, studies with statistically significant findings supporting a
researchers must adjust error rates, examine families of studys hypotheses are more likely to be submitted for
hypotheses in multivariate analyses, or stick to a small publication and ultimately published, given reviewer and
number of a priori hypotheses. editor biases against the null hypothesis (Greenwald 1975;
Rosenthal 1979). An extension of this bias may even op-
28.2.6 Biased Effect-Size Sampling erate among published studies if those that are easier to
identify and retrieve have different effect sizes than other
Research reports frequently present more than one esti- studies. This could arise because of the greater visibility of
mate, especially when there are multiple outcome mea- studies in major publications that are well abstracted by
sures, multiple treatment and control groups, and multiple the major referencing services, because authors in major
delayed assessment time points. Some of these effect es- outlets are better connected in professional networks, or
timates may be irrelevant for a particular topic, and some because the papers title or abstract is more likely to con-
of the relevant ones will be substantively more important tain keywords relevant for a particular meta-analysis. Un-
than others. Meta-analysts must then decide which esti- successful replications by lesser-known researchers are
mates will enter the meta-analysis. Bias occurs when es- not likely to appear in major journals, however appropri-
timates are selected that are as substantively relevant as ate they might be for a conference or a more obscure pub-
those not selected but that have different average effect lication.
sizes. Georg Matt discovered this when three indepen- Strenuous attempts should be made to find unpublished
dent coders recoded a subsample of the studies in Mary or difficult-to-retrieve studies, and separate effect-size es-
Lee Smith, Gene Glass, and Thomas Millers meta-analy- timates should be calculated for published and unpub-
sis of the benefits of psychotherapy (Matt 1989; Smith, lished studies as well as for studies that varied in how dif-
Glass, and Miller 1980). Following what seemed to be ficult they were to locate. The chapters in this volume on
the same rules as Smith and her colleagues for selecting scientific communication (4), reference databases (5), the
and calculating effect estimates, the recoders extracted fugitive literature (6), and publication bias (23) all pro-
almost three times as many effect estimates whose mean vide additional suggestions for dealing with publication
effect size was approximately .50, compared to the .90 bias (see also Rothstein, Sutton, and Borenstein 2005).
from the Smith, Glass, and Miller original codings. Rules Several important developments should be stressed.
are required in each meta-analysis that clearly indicate First, research registries are particularly promising for
which effect estimates to include; and best practice is to avoiding publication biases, especially in research do-
specify these rules before data collection begins. If sev- mains under the influence of regulatory agencies, such as
eral plausible rules are identified, it is then important to the Food and Drug Administration. Second, the Cochrane
examine whether they lead to the same results. Although Collaboration has taken the idea of research registries one
specifying such rules is relatively easy, implementing step further, developing a registry for prospective research
them validly (and reliably) depends on training coders syntheses, in which eligible studies for a meta-analysis
to identify relevant effect sizes within studies. Data anal- are identified and evaluated before their findings are even
546 SUMMARY

known. Third, meta-analysts can now rely on a set of possible, meta-analysts are advised to calculate effect
powerful exploratory data analysis methods to detect bi- sizes directly from means, standard deviations, or the like
ases in published studies (Egger et al. 1997; Sterne and rather than from approximations such as truncated p-levels
Egger 2001; Sutton, Duval et al. 2000; Sutton, Song et al. or proportions improved. This will be possible for many
2000). And, finally, if the evidence indicates that effect studies, though not all. Simulation studies of the perfor-
estimates depend on publication history, the likely conse- mance of different indicators like that of Snchez-Meca
quences of this bias should be assessed with a variety of and his colleagues can help make an informed choice, but
methods, including Larry Hedges and Jack Veveas selec- when they are not available, meta-analysts should em-
tion method approach (Vevea and Hedges 1995; Hedges pirically examine whether estimates differ by the type of
and Vevea 1996; Vevea, Clements, and Hedges 1993; effect size transformation used.
Vevea and Woods 2005), and Susan Duvals trim and fill
method (Duval and Tweedie 2000; Sutton, Duval et al. 28.2.9 Lack of Statistical Independence
2000).
Stochastic dependencies among effect sizes may influ-
28.2.8 Bias in Computing Effect Sizes ence average effect estimates and their precision (see
chapter 19, this volume). The effect size estimates in a
Meta-analyses often require transforming findings from meta-analysis may lack statistical independence for at
primary studies into a common metric such as the corre- least four reasons: collecting data on multiple outcomes
lation coefficient, a standardized mean difference, or an for the same respondents; comparing different interven-
odds ratio (see chapter 12, this volume). This is required tions to a single control group, or different controls to a
because studies differ in the type of quantitative informa- single intervention; calculating an effect estimate for each
tion they originally provide, and bias results if some types of several sub samples of person within the same study
of transformation lead to systematically different esti- (for example, women and men); and the same research
mates of average effect size or standard error when com- team conducts multiple studies on the same topic (Hedges
pared to others. 1990). These situations can be conceptualized hierarchi-
The best-understood case of transformation bias con- callyfor instance, as multiple outcomes or repeated as-
cerns the situation in which probability levels are aggre- sessments of the same measure nested within each study.
gated across studies and some are truncated (Rosenthal Ignoring or misspecifying the resulting covariance struc-
1990; Wolf 1990); for instance, when a study reports a ture of effect sizes can lead to invalid estimates of mean
group difference statistically significant at p .05 with- effect sizes and their standard errors (see also chapter 18,
out specifying the exact probability level. A conservative this volume).
estimate of the observed effect size can be obtained by Leon Gleser and Ingram Olkin discuss several ap-
assuming that p .05 then finding the relevant critical proaches to deal with such dependencies (chapter 19, this
value of the test statistic that would have been observed volume). The most simple involves analyzing for each
(given the appropriate degrees of freedom) at p .05. Be- study only one of the set of possible correlated effects, for
cause it is known that the actual probability value was example, the mean or median of all effects, a randomly
smaller than the one assumed, the transformed effect size selected effect estimate, or the most theoretically relevant
is known to be conservatively biased. estimate (Lipsey and Wilson 2001). Another involves Bon-
Another case of potential bias involves studies that used ferroni and ensemble adjusted p-values. Although these
dichotomous outcomes measures or continuous variables strategies are relatively simple to apply, each is conserva-
that have been dichotomized (for example, improved ver- tive and fails to take into account all of the available data.
sus not improved, convicted versus not convicted). Julio Larry Hedges and Ingram Olkin therefore developed a
Snchez-Meca, Fulgencio Marn-Martnez, and Salvador multivariate statistical framework in which dependencies
Chaon-Moscoso compared seven approaches to convert- can be directly modeled (Hedges and Olkin 1985; see
ing effect sizes derived from continuous and dichoto- also Raudenbush, Becker, and Kalaian 1988; Rosenthal
mous outcome variables into a common metric. They and Rubin 1986). Bayesian and hierarchical linear model-
concluded that the Cox and Probit based effect-size indi- ing approaches have also been successfully applied to
ces showed the least bias across simulated population ef- deal with multiple effect size within studies (Raudenbush
fect sizes, , ranging from 0.2 to 0.8 (2003). Whenever and Bryk 1985; van Houwelingen, Arends, and Stijnen
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 547

2002; Saleh et al. 2006; Eddy, Hasselblad, and Shachter universe of studies from which the studies at hand have
1990; Sutton and Abrams 2001; Scott et al. 2007; Nam, been sampled. Each study is assumed to have its own
Mengersen, and Garthwaite 2003; Hox and de Leeuw unique true effect and to be a sample realization from a
2003; Prevost, Abrams, and Jones 2000; see also chapter universe of related, yet distinct studies. Thus, observed
26, this volume). These multivariate models are more differences in treatment effects have two sources, which
complex than the previously discussed univariate tech- are different samples of participants (as in the fixed-effect
niques. Some require estimates of the covariance struc- model) and true differences in treatment effects between
ture among the correlated effect sizes that may be difficult studies (compared to a single true effect size in the fixed
to obtain, in part because of missing information in pri- model). This means that in the random-effects model,
mary studies. Moreover, the gains in estimation due to treatment effects are best represented as a distribution of
using these multivariate techniques will often be small. true effects represented by their expected value and vari-
ance.
These two models have important implications for the
28.2.10 Failure to Weight Effect Sizes analysis and interpretation of effect size estimates (see
Proportionally to Their Precision chapters 15 and 16, this volume). The fixed-effect model
is analytically simpler, requiring estimates of the specific
Everything else being equal, studies with larger sample
fixed effects of interest (for example, mean effect sizes
sizes yield effect-size estimates that are more precise (that
for treatment A, B, and C) and their precision. The ran-
is, have smaller standard errors) than studies with smaller
dom-effects model requires estimating the population
ones. Simply averaging effect sizes of different precision
variance associated with the universe of treatment effects
may yield biased average effects and sampling errors even
given the estimates observed in a sample of treatments
if each studys estimates are themselves unbiased (see
and their precision. As a consequence of the different un-
chapter 14, this volume). Therefore, when effect sizes are
derlying statistical models, fixed-effects analyses limit
combined, it has become common practice to use weighted
inferences to the specific fixed levels of a factor that were
averages, allowing more precise estimates to have a stron-
included in a meta-analysis (for example, treatments A,
ger influence on the overall findings. Larry Hedges and
B, and C). In contrast, random effects models strive for
Ingram Olkin have shown that the optimal weight is the
inferences about the population of levels of a factor, given
inverse of the sampling variance of an effect size (1985).
the samples of levels that were included in a meta-analy-
This validity threat was common in early meta-analyses
sis (for example, population of treatments consisting of
of psychotherapy outcomes but has become increasingly
A, B, C, F, . . . , Z ).
rare in recent years in meta-analyses using common ef-
The decision to assume a fixed- or random-effects
fect-size metrics, for example, d, r, odds ratios (Shapiro
model is primarily influenced by the substantive assump-
and Shapiro 1982; Smith, Glass, and Miller 1980).
tions meta-analysts make about the processes generating
an effect and about the desired inferences. That is, are the
28.2.11 Underjustified Use of Fixed or Random treatments, settings, outcomes, and participants exam-
Effects Models ined in different studies sufficiently standardized and
similar that they should be considered equivalent (that is,
When analyzing effect sizes, Larry Hedges and Ingram fixed) for purposes of interpreting treatment effects? Are
Olkin stressed the importance of deciding on a model we interested only in drawing inferences about the spe-
with fixed or random effects (1985). Perhaps the most im- cific instances of treatments, settings, outcomes, and par-
portant difference between the two models concerns the ticipants that were studied (that is, fixed-effect model)?
inferences they allow. Fixed-effect models justify infer- Or are we interested in drawing more general inferences
ences about treatment effects only for the studies at hand. about the classes of treatments, settings, and outcomes to
In their simplest form, they assume that all the studies in which the specific instances that were studied belong (that
a meta-analysis involve independent samples of partici- is, random-effect model)? Only secondarily is the deci-
pants from the same population, and so it is legitimate to sion influenced by the heterogeneity observed in the data
postulate a single but unknown effect size, estimates of and the results of a homogeneity test. This is particularly
which differ only as a result of sampling variability. In true because tests of homogeneity tend to have low statis-
contrast, random effects models justify inferences about a tical power in situations commonly found in research
548 SUMMARY

syntheses, for example, modest heterogeneity, unequal Table 28.2 Threats to Inferences About Causal Nature
sample sizes, and small within-study sample sizes (Hardy of Relationship Between Treatment and
and Thompson 1998; Harwell 1997; Hedges and Pigott Outcome Classes
2001; Jackson 2006).
Assuming excellent statistical power, a fixed model (1) Absence of studies with successful random assignment
seems more reasonable if the hypothesis of homogene- (2) Primary study attrition
ity is not rejected and if no third variables can be ad-
source: Authors compilation.
duced from theory or experience that may have caused
(unreliable) variation in effect size estimates. However,
if theory suggests a fixed population effect but a homo- randomization trials must take design effects into account
geneity test indicates otherwise, then two possible inter- to obtain correct estimates of statistical power (Donner,
pretations follow. One is that the assumption of a fixed Piaggio, and Villar 2001, 2003).
population is wrong and that a random-effects model is Some meta-analysts are limited to a few studies, each
more appropriate; the second is that the fixed-effects having small sample sizes. Others are interested in exam-
model is true, but variables responsible for the increased ining effect sizes for subclasses of treatments and out-
variability are not known or adequately modeled and as comes, different types of settings, and different subpopula-
such, in this research case, cannot account for the ob- tions. Careful power analyses can help clarify the type II
served heterogeneity. Thus, there is no simple indicator error rate of a test failing to reject a null hypothesis. The
of which model is correct. However, it is useful to exam- meta-analyst then has to decide which trade-off to make
ine as critically as possible all the substantive theoretical between the number and type of statistical tests and the
reasons for inferring a fixed- over a random-effects statistical power of these tests.
model, whether the homogeneity assumption is rejected
or not.
28.3 THREATS TO INFERENCES ABOUT THE
CAUSAL NATURE OF AN ASSOCIATION
28.2.12 Lack of Statistical Power for Detecting BETWEEN TREATMENT AND OUTCOME
an Association CLASSES
Although the focus of meta-analyses is on estimating the Threats to inferences about whether an association be-
magnitude of effects and their precision rather than null tween treatment and outcome classes is causal or spuri-
hypothesis testing, meta-analysts often report findings ous arise mostly out of the designs of primary studies. It
from hypothesis tests about the existence of an associa- is at the level of the individual study that the temporal
tion. Under most circumstances, the statistical power for sequence of cause and effect, randomization of units to
detecting an association in a meta-analysis is influenced conditions, and other design features aimed at strength-
by the number of studies, the sample sizes within studies, ening causal inferences are implemented.
the type of assignment of units to experimental conditions The following threats are in addition to those already
in the primary studies (for example, cluster randomiza- presented that refer to the likelihood of an association be-
tion), and (for analyses employing random effects assump- tween treatment and outcome classes. See table 28.3 for a
tions) the between-studies variance component (Hedges list of these threats. The logic of causal inference is fairly
and Pigott 2001; Donner, Piaggio, and Villar 2003; Cohn straightforward at the primary study level, but is compli-
and Becker 2003). When compared to statistical analyses cated at the research synthesis level because findings
in primary studies, tests of mean effect sizes will typi- from partially flawed primary studies often have to be
cally be more powerful in meta-analyses, particularly in combined. Inferences about causation are not necessarily
fixed-effect models estimating the average effect of a jeopardized by deficiencies in primary studies because
class of interventions based on many similar studies. at least in theoryindividual sources of bias in the pri-
There are some paradoxical exceptions to this rule in ran- mary studies may cancel each other out exactly when
dom-effect models, for instance, when small studies add aggregated in a research synthesis. So at the research syn-
more between-studies variance than they compensate for thesis level, a threat arises only if the deficiencies within
by adding information about the mean effect (Hedges each primary study combine across studies to create a
and Pigott 2001). Similarly, research syntheses of cluster predominant direction of bias.
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 549

28.3.1 Absence of Studies with Successful biases have been identified and their operation has be cor-
Random Assigment rectly modeled. This latter approach is particularly useful
when the first approach fails if too few primary studies
In primary studies, unbiased causal inference requires es- with randomized designs exist.
tablishing the direction of causality and ruling out third-
variable alternative explanations. Inferring the direction
of causality is easy in experimental and quasi-experimen- 28.3.2 Primary Study Attrition
tal studies where knowledge is usually available about the Attrition of participants from treatment or measurement
temporal sequence from manipulating the treatment to is common in even the most carefully designed random-
measuring its effects. In theory, third-variable alternative ized experiments and quasi-experiments. If attrition in
explanations are ruled out when participants in primary these primary studies is differential across treatment
studies are randomly assigned to treatment conditions or groups, effect estimates may then be biased, inflating or
if a regression-discontinuity study is done (Shadish, Cook, deflating causal estimates. If the biases operating in each
and Campbell 2002; West, Biesanz, and Pitts 2000). direction cancel each other out, then causal inference at
In research syntheses of well-implemented random- the research synthesis level is not affected. However, if
ized trials, the strong causal inferences at the primary predominant bias persists across the primary studies, then
study level are transferred to the research synthesis level. causal inferences from the synthesis are called into ques-
This is why, for example, the Cochrane Collaboration tion. For instance, Mark Lipseys meta-analysis of juve-
generally restricts its syntheses to randomized experi- nile delinquency interventions found that the more ame-
ments. If randomization was poorly implemented or not nable juveniles might have dropped out of treatment
done at all, then causal inferences are ambiguous. Addi- groups and the more delinquent juveniles out of control
tionally, the pervasive possibility of third-variable expla- groups (1992). The net effect at the level of the research
nations arise, as in quasi-experimental studies of early synthesis is a potential bias.
childhood interventions and juvenile delinquency, where To address this threat, it is critical to code, for each
those participating in programs are often more disadvan- primary study, information about the level and differen-
taged and likely to mature at different rates than those in tial nature of attrition from the treatment groups. The lat-
control groups (Campbell and Boruch 1975). If lack of ter information cannot be inferred from the level of attri-
random assignment in primary studies yields a predomi- tion, but requires the primary study to have reported the
nant bias across these studies, causal inference at the possible differential attrition of participants. If this infor-
level of the research synthesis is jeopardized. mation is available, the analyst can then examine the ef-
One approach to this threat compares the average ef- fect of attrition on effect estimates by disaggregating
fect estimates from studies with random assignment to studies according to their level of total and differential
those studies on the same question with more systematic attrition. In addition, sensitivity analyses can be con-
assignment. If effect estimates differ, causal primacy ducted to explore whether the observed levels of attrition
must then be given to findings from randomized designs, present plausible threats.
assuming that the randomized and nonrandomized stud-
ies are equivalent on other characteristics. Although ran-
domized experiments and quasi-experiments sometimes 28.4 THREATS TO GENERALIZED INFERENCES
result in similar estimates, those that do not are notable
(Shadish, Luellen, and Clark 2006; Jacob and Ludwig Although some of the threats discussed in the previous
2005; Boruch 2005; Lipsey 1992; Smith, Glass, and sections are germane to generalization (such as underjus-
Miller 1980; Wittmann and Matt 1986; Chalmers et al. tified use of fixed and random effects models), we now
1983). A second strategy involves the careful explication turn explicitly to generalization issues.
of possible biases caused by flawed randomization or as-
sociated with a particular quasi-experimental design in 28.4.1 Sampling Biases Associated with Persons,
each primary study. Based on an adequate understanding Treatments, Outcomes, Settings, and
of possible biases, adjustments for pretreatment differ- Times
ences between groups can sometimes be made to project
causal effects, controlling for the identified biases. The Statistical sampling theory is sometimes invoked as the
problem here, of course, is justifying that all important justification for generalizing from the obtained instances
550 SUMMARY

Table 28.3 Threats to Generalized Inferences lection of studies to all those with specific person, treat-
in Research Syntheses ment, outcome, setting, or time characteristics of sub-
stantive importance (Mick et al. 2003; Moyer et al. 2002).
Inferences to Target Constructs and Universes In either of these circumstances, inferences from the sam-
(1) Sampling biases associated with the persons, ples to their target universes can be biased because the
treatments, outcomes, settings, and times studies that were excluded may have different average ef-
(2) Underrepresentation of prototypical attributes fect sizes than those that were included. As a result, gen-
(3) Restricted heterogeneity of substantively irrelevant
third variables
eralizations can easily be overstated, even if they are sup-
(4) Mono-operation bias ported by data from hundreds of studies and thousands of
(5) Mono-method bias research participants.
(6) Rater drift
(7) Reactivity effects 28.4.2 Underrepresentation of Prototypical
Inferences about Robustness and Moderating Conditions Attributes
(8) Restricted heterogeneity in universes of persons,
treatments, outcomes, settings, and times Research syntheses should start with the careful explica-
(9) Moderator variable confounding tion of the target constructs about which inferences are to
(10) Failure to test for homogeneity of effect sizes be drawn, at a minimum identifying their prototypical at-
(11) Lack of statistical power for homogeneity tests tributes and any less central features at the boundaries
(12) Lack of statistical power for studying with other constructs. Thus, it can be that the collection
disaggregated groups of primary studies in the research synthesis does not con-
(13) Misspecification of causal mediating relationships tain representations of all the prototypical elements. This
Extrapolations to Novel Constructs and Universes was the case when William Shadish and his colleagues
(14) Misspecification of models for extrapolation attempted to investigate the effects of psychotherapy in
clinical practice (2000). They observed that many studies
source: Authors compilation.
included a subset of prototypical features of clinical prac-
tice, but no single study included all of its features as they
defined them. As a result, generalization to real-world
(or samples) of persons, treatments, outcomes, and set- psychotherapy practice is problematic, even though there
tings to the universes or domains they are thought to rep- are literally thousands of studies of the effectiveness of
resent, and about which researchers want to draw infer- psychotherapy practice. In such a situation, the opera-
ences. However, it is rare in individual primary studies tions implemented in the primary studies force us to re-
for instances to be selected at random from their target vise the construct about which inferences are possible,
universes, just as it is rare in research syntheses to ran- reminding us that we cannot generalize to the practice of
domly sample primary studies, treatments, outcomes, or psychotherapy as it is commonly conducted in the United
settings from the universe of ever conducted research on States. An important task of meta-analysis is to inform
a given topic. Some exceptions are noteworthy, as when the research community about the underrepresentation of
Robert Orwin and David Cordray (1985) and Georg Matt prototypical elements of core constructs in the literature
(1989) randomly sampled studies from those included in on hand, turning attention to the need to incorporate them
the Smith, Glass, and Miller (1980) meta-analysis of the into the sampling designs of future studies.
psychotherapy outcomeassuming that Smith and her
colleagues had a census of all studies or a random sample 28.4.3 Restricted Heterogeneity of Substantively
from the census. Perhaps the most feasible random selec- Irrelevant Third Variables
tion in meta-analysis is when, as some methodologists
have suggested (Lipsey and Wilson 2001), only one ef- Even if sampling from the universes about which gener-
fect size is sampled per study so as to provide an unbiased alized inferences are sought were random, and if impor-
estimate of the mean effect size per study. tant prototypical elements were represented in the re-
More common study selection strategies are for meta- viewed studies, a threat arises if a research synthesis cannot
analysts to seek collecting the population of published demonstrate that the causal association is robust and
and unpublished studies on a topic, or restricting the se- holds across substantively irrelevant characteristics. For
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 551

instance, if the reviewed studies on the effectiveness of following the operations delineated in a coding manual,
homework in middle school were conducted by just one bias is possible. To avoid it, coding procedures for meta-
research team, relied on voluntary participation by stu- analyses have to incorporate multimethod coding ap-
dents, or depended on teachers being highly motivated, proaches. This could include having multiple coders (per-
the threat would then arise that all conclusions about the haps with different substantive backgrounds related to the
general effectiveness of homework are confounded with research topic) code all items of a coding manual, con-
substantively irrelevant aspects of the research context. tacting the original authors to provide clarification and
To give an even more concrete example, if private schools additional information, obtaining additional write-ups of
were disclosed to be those where school revenues come a study, and relying on external, supplementary sources,
from student fees, donations, and the proceeds on endow- such as describing the psychometric properties of an out-
ments (rather than from taxes), it is irrelevant whether the come measure or allegiance and experience of a re-
schools are parochial or nonparochial, Buddhist, Catho- searcher (Robinson, Berman, and Neimeyer 1990). In the
lic, Jewish, Muslim, or Protestant, military, or elite aca- absence of such improvements, conclusions about impor-
demic. To generalize to private schools in the abstract tant target constructs relying on single coder ratings re-
requires being able to show that relationships are not lim- main suspect.
ited to one or a few of these contextssay, parochial or
military schools. 28.4.6 Rater Drift
Limited heterogeneity of universes will also impede
the transfer of findings to new universes (that is, extrapo- Reading, understanding, and coding publications of mul-
lation), because it hinders the ability to demonstrate the tifaceted primary studies involve many cognitively chal-
robustness of a causal relationship across substantive ir- lenging tasks. Over time, coders learn through practice,
relevancies of design, implementation, or measurement develop heuristics to simplify complex tasks, fatigue and
method. The wider the range and the larger the number of become distracted, and may change their cognitive sche-
substantively irrelevant aspects across which a finding is mas as a result of exposure to study reports. As a conse-
robust, and the better moderating influences are under- quence, the same coder may unknowingly change over
stood, the stronger the belief that the finding will also time such that earlier codings of the same evidence differ
hold under the influence of not yet examined contextual from later ones. To address this validity threat, rater drift
irrelevancies. needs to be monitored as part of a continuing coder train-
ing program. Moreover, changes in coding manuals
28.4.4 Mono-Operation Bias should be made publicly to reflect changes in the under-
standing of a code, which may necessitate recoding stud-
The coding and rating systems of research syntheses ies that were examined under the earlier coding rules.
often rely exclusively on single items to measure such
complex target constructs as setting, diagnosis, or treat- 28.4.7 Reactivity Effects
ment type. It is well known from the psychometric litera-
ture that single-item measures have poor measurement A measure is said to be reactive if the measurement pro-
properties. They are notoriously unreliable, tend to un- cess itself influences the outcome (Webb et al. 1981). In
derrepresent a construct, and are often confounded with the case of a research synthesis, the coding process itself
irrelevant constructs. To address and improve on com- may inadvertently influence the coding outcome. For in-
mon measurement limitations of meta-analytic coding stance, knowing that the author of a study is a well-known
manuals, rigorous interrater reliability studies and stan- expert in the field, rather than an unknown novice, may
dard scale development procedures have to become stan- predispose a coder to rate research design characteristics
dard practice to allow valid inferences about target con- more favorably for the expert than for the novice. Simi-
structs of a meta-analysis. larly, knowing that the treatment yielded no benefits over
the control condition may bias a coder to rate the imple-
28.4.5 Mono-Method Bias mentation of the treatment condition more critically. To
minimize reactivity biases, it is desirable to mask raters to
If the measurement method in a meta-analysis relies on a any influence that could bias their codings, including au-
single coder who reads and codes a research publication, thorship and study results. Further, as much as possible,
552 SUMMARY

attempts should be made to avoid reactivity biases, in- findings and the more heterogeneous the populations,
cluding codings that require as little rater inference as settings, treatments, outcomes, and times in which they
possible. For example, rather than asking raters to arrive were observed, the greater the belief that similar findings
at an overall code describing study quality, it would be will be observed beyond the populations studied. If the
better to ask them to code specific aspects of studies that evidence for stubborn empirical robustness can be aug-
pertain to quality (such as the method of allocating par- mented by evidence for causal moderating conditions, the
ticipants to groups and attrition), which are less likely to novel universes in which a causal relationship is expected
be reactive (see chapters 7 and 9, this volume). to hold can be even better identified.
The logical weakness of this argument lies in its induc-
28.4.8 Restricted Heterogeneity in Universes tive basis. That psycho-educational interventions with
of Persons, Treatments, Outcomes, adult surgical patients have consistently shown earlier re-
Settings, and Times lease from the hospital across a broad range of major and
minor surgeries, diverse respondents, and treatment pro-
Inferences about the conditions moderating the magni- viders throughout the 1960s, 1970s, and 1980s (Devine
tude and direction of an association are facilitated if the 1992) cannot logically guarantee the same effect will
relationship can be studied for a large number of diverse hold for as yet unstudied surgeries, and in the future.
persons, treatments, outcomes, settings, and times. This However, the robustness of the relationship does strengthen
is the single most important potential strength of research the belief that psycho-educational interventions will have
syntheses over individual studies. Although meta-analy- beneficial effects with new groups of patients and novel
ses of standardized treatments and specifically designated surgeries in the near future (that is, extrapolation). Homo-
outcomes, populations, and settings may increase the geneous populations, treatments, outcomes, settings, and
precision of some effect size estimates, such restrictions times limit the examination of causal contingencies and
hamper our ability to better understand the conditions robustness, and consequently, impede inferences about
under which such relationships can and cannot be ob- the transfer of findings to novel contexts.
served. Rather than limiting a meta-analysis to a review
of a single facet of the universe of interest or lumping to- 28.4.9 Moderator Variable Confounding
gether a heterogeneous set of facets, we encourage meta-
analystssample sizes permittingto explicitly repre- At the synthesis level, moderator variables describe char-
sent and take advantage of such heterogeneities. acteristics of the classes of treatment, outcomes, settings,
Such advice has implications for those observers who populations, or times across which the magnitude or di-
have suggested that causal inferences in generaland rection of a causal effect differsa generalizability ques-
causal moderator inferences in particularcould be en- tion. Claims about moderator variables are involved when
hanced if meta- analysts relied only on studies with supe- a research synthesis concludes that treatment class A is
rior methodology, particularly randomized experiments less effective in population C than D, stronger with out-
with standardized treatments, manifestly valid outcomes, come of type E than F, weaker in setting G than H or
and clearly designated settings and populations (Chalm- positive in past times but not recently. Moderator vari-
ers et al. 1989; Chalmers and Lau 1992; Sacks et al. 1987; ables are even involved when the claim is made in a syn-
Slavin 1986). Such a strategy appears to be useful in areas thesis that treatment type A is superior to treatment type
where research is fairly standardized. However, other B, because this assumes that the studies of A are equiva-
limitations arise because this standardization limits the lent to those of B on everything correlated with the out-
heterogeneity in research designs, treatment implementa- come other than A versus B. Any uncertainty about this is
tions, outcome measures, recruitment strategies, subject pertains to the role that average study differences might
characteristics, and the like. Hence it is not possible to have played in achieving the obtained difference between
examine empirically how robust a particular effect is that studies of A and B; thus it is an issue of moderator vari-
has been obtained in a restricted, standardized context. ables.
Meta-analysis has the potential to increase ones confi- Threats to valid inference about the causal moderating
dence in generalizations to new universes (that is, extrap- role in research syntheses are pervasive (Lipsey 2003;
olation) if findings are robust across a wide range and Shadish and Sweeney 1991). This is because participants
large number of different universes. The more robust the in primary studies are rarely assigned to moderator con-
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 553

ditions at random (for example, to outcome type E versus implication is that subclasses of studies may exist that
F or to settings G versus H). Claims about the causal role differ in population effect size, triggering the search to
of moderators are often questionable in syntheses, how- identify the nature of such subclasses. Hence, heteroge-
ever, because any one moderator is often confounded neity tests play an important role in examining the robust-
with other study characteristics. For instance, studies of ness of a relationship and in initiating the search for po-
smoking cessation in hospital primary care facilities are tential moderating influences.
likely to attract older participants than similar studies on The failure to test for heterogeneity may result in
college campuses. Likewise, behavioral treatments tend lumping manifestly different subclasses of persons, treat-
to involve behavioral outcome measures of a narrow tar- ments, outcomes, settings, or times into one class. This
get behavior, whereas client-centered interventions are problem is frequently referred to as the apples and or-
likely to rely on broader measures of well-being and per- anges problem of meta-analysis. However, Gene Glass,
sonal growth. If the moderator variable (for example, pri- Robert Rosenthal, and Georg Matt have all argued that
mary care setting versus college campus) is confounded apples and oranges should indeed be mixed if the interest
with characteristics of the design, setting, or population is in generalizing to such higher-order characteristics as
(for example, age), differences in the size or direction of fruit or whatever else inheres in an array of different
a treatment effect that might be attributable to the mod- treatments, outcomes, settings, persons, and times (Glass
erator are instead confounded with other attributes. 1977; Smith, Glass, and Miller 1980; Rosenthal 1990;
To deal with this possibility, meta-analysts should ex- Matt 2003, 2005). We should indeed be willing to com-
amine within-study comparisons of the moderator effect bine studies of manifestly different subclasses of per-
because these are obviously not prone to between-study sons, treatments, outcomes, settings, or times if they
confounds. For instance, if the moderating role of treat- yield equivalent results in reviews. In this context, the
ment types A, B, and C is at stake, a meta-analysis can be homogeneity test indicates when studies yield such dif-
conducted of all the studies with internal comparisons of ferent results that a single, common average effect size
the treatment types (see Shapiro and Shapiro 1982). If the needs to be disaggregated through blocking on study
moderating role of subject gender is of interest, infer- characteristics that might explain the observed variance
ences about gender differences might depend on within- in effect sizes.
study contrasts of males and females rather than on com- Coping with this threat is relatively simple, and homo-
parisons of studies with only males or females or that geneity tests have become standard both as a way of iden-
assess the percentage of males. Statistical modeling pro- tifying the most parsimonious statistical model to de-
vides another popular approach (Shadish and Sweeney scribe the data and of testing model specification (see
1991). The validity of the causal inferences based on such chapter 15, this volume). However, the likelihood of de-
models depends on the ability of the meta-analyst to sign flaws in primary studies, of publication biases, and
identify and reliably measure all confounding variables. the like makes the interpretation of homogeneity tests
Multivariate statistical adjustments can be informative more complex (Hedges and Becker 1986; Schulze 2004).
here, but not definitive because of the difficult task of If all the studies being meta-analyzed share the same
conceptualizing all such confounds and measuring them flaw, or if studies with zero and negative effects are less
well across the studies under review. likely to be published, then a consistent bias results across
studies and can make the effect sizes appear more homo-
28.4.10 Failure to Test for Homogeneity geneous than they really are. Further, even if the collection
of Effect Sizes of studies is not biased, a failure to reject the null-hypoth-
esis that observed effect sizes are a sample realization of
Under a fixed-effect statistical model, a statistical test for the same population effect does not prove it. Conversely,
homogeneity assesses whether the variability in effect es- if all the studies have different design flaws, effect sizes
timates exceeds that expected from sampling error alone could be heterogeneous even though they share the pop-
(Hedges 1982; Rosenthal and Rubin 1982). If the null hy- ulation effect. Obviously, the causes of heterogeneity
pothesis of homogeneous effect sizes is retained, a single that are of greatest interest are substantive rather than
population effect size provides a parsimonious model of methodological. Consequently, it is useful to differenti-
the data, and the weighted mean effect size d provides an ate between homogeneity tests conducted before and
adequate estimate. If the null hypothesis is rejected, the after the assumption, that all study-level differences in
554 SUMMARY

methodological irrelevancies have been accounted for, 28.4.13 Misspecification of Causal Mediating
has been defended. Relationships

28.4.11 Lack of Statistical Power for Mediating relationships are examined to shed light on the
Homogeneity Tests processes by which a causal agent, such as secondhand
smoke, transmits its effects on an outcome, such as asthma
A homogeneity test examines the null hypothesis that two exacerbation. They provide highly treasured explanations
or more population effect sizes are equal, thus probing of otherwise merely descriptive causal associations. In-
whether the observed variability in effect sizes is more stead of simply noting that a treatment affects an out-
than would be expected from sampling error alone. It is come, mediating relationships inform us why and how a
thus the gatekeeper test for deciding to continue the search treatment affects the outcome. Explanatory models rely
for variables that moderate the average effect size ob- on causal mediating relationships to justify extrapolations
tained in a reviewa generalization task. But when these and to specify causal contingencies. Both are important
homogeneity tests have little statistical power, as is usu- generalization tasks.
ally the case, their type II error rate will be large and lead Few meta-analyses of mediating processes exist, and
to the erroneous conclusions that the search for modera- those that do use quite different strategies. Monica Harris
tor variables should be abandoned (Jackson 2006; Hedges and Robert Rosenthals meta-analysis of the mediation of
and Pigott 2001; Harwell 1997). In sum, statistically non- interpersonal expectancy effects used the available ex-
significant homogeneity tests provide inclusive evidence perimental studies that had information on at least one of
and should not be relied on exclusively to justify conclu- the four mediational links abstracted from relevant sub-
sions about the robustness of an association. stantive theory, which were never tested together in any
individual study (1985). Steven Premack and John Hunt-
28.4.12 Lack of Statistical Power for Studying ers meta-analysis of mediation processes underlying in-
Disaggregated Groups dividual decisions to form a trade union relied on com-
bining correlation matrices from studies on subsets of
If there is reason to believe that treatment effects may dif- variables believed to mediate a cause-effect relationship
fer across types of treatments, outcomes, persons, set- (1988). The entire mediation model had not been probed
tings, or over time (that is, moderator effects), highly ag- in any individual study. Betsy Beckers meta-analysis re-
gregated classes (for example, therapy or well-being) have lied on the collection of individual correlations or corre-
to be disaggregated to examine the conditions under which lation matrices to generate a combined correlation matrix
an effect changes direction or magnitude. Such subgroup (1992). That is, individual studies may contribute evi-
analyses rely on a smaller number of studies than main dence for one or more of the different causal links postu-
effect analyses and involve additional statistical tests that lated in the mediational model. All these approaches ex-
may necessitate procedures for type I error control, lower- amine mediational processes where the data about links
ing the statistical power for the subanalyses in question. between variables come from within-study comparisons.
Consequently, the likelihood of finding statistically sig- However, a fourth mediational approach infers causal
nificant differences are reduced, even if such differences links from between-study comparisons. William Shadishs
exist in the population. The flaws of this erroneous infer- research on the mediation of the effects of therapeutic ori-
ence are compounded if a meta-analyst then concludes entation on treatment effectiveness is a salient example
that an effect generalizes across subgroups of the universe (1992; Shadish and Sweeney 1991).
because no statistically significant differences were found. Although the four approaches differ considerably, they
Large samples of studies mitigate this problem to some all struggle with the same practical problemhow to test
extent, though the power in research syntheses is more mediational models involving multiple causal connec-
complex than in primary studies. This is because power tions when only few (or no) studies are available in which
depends not only on the effect size, type I error rate, and all the connections have been examined within the same
sample size of primary study participants, but also on the study. A major source of threats to the meta-analytic study
number of studies and the underlying statistical model of mediation processes arises when the different media-
(Cohn and Becker 2003; Hedges and Pigott 2004). tional links are supported by different quantities and qual-
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 555

ities of evidence, perhaps because of excessive missing generalized inferences this evidence allows to the target
data on some links, the operation of a predominant direc- universes represented in the studies. Therefore, all threats
tion of bias for other links, or publication biases involv- discussed up to this point for generalized inferences to
ing yet other parts of the mediational model. target universes (28.4.1 to 28.4.7) and robustness and
On the basis of the few available meta-analyses of moderators (28.4.8 to 28.4.13) necessarily apply and af-
causal mediational processes, it seems reasonable to hy- fect extrapolations to novel constructs and universes.
pothesize that missing data are likely to be the most per- Based on the available evidence, extrapolations rely on
vasive reason for misspecifying causal models (Cook et al. explicit and implicit models or informal heuristics to
1992). Missing data may lead the meta-analyst to omit project treatment effects under yet-unstudied conditions,
important mediational links from the model that is actu- such as the effects of a new cancer drug in pediatric pa-
ally tested and may also prevent consideration of alterna- tients based on evidence in adult populations or the ef-
tive plausible models. Although there are many substan- fects of psychotherapy in clinical practice based on stud-
tive areas with numerous studies probing a particular ies conducted in research settings. At issue, then, is the
descriptive causal relationship, there are far fewer with as validity of the model used to project the effects of an off-
many that provide a broad range of information on me- label use of a drug (for example, based on weight, me-
diational processes, and even fewer where two or more tabolism, development stage of organs) or the effects of
causal mediating models are directly compared. There psychotherapy in practice (for example, based on casel-
are many reasons for these deficiencies. Process models oad of therapists, patient mix, manualized or eclectic
are derived from substantive theory, and these theories therapy). Under the best conditions, these models are in-
change over time as they are improved or as new theories formed by a comprehensive understanding of the causal
emerge. Obviously, the novel constructs in todays theo- mediating processes and empirical evidence about the ro-
ries are not likely to have been measured well in past bustness of effects and their moderating conditions across
work. Moreover, todays researchers are reluctant to mea- a broad range of substantively relevant and irrelevant
sure constructs from past theories that are now out of variables. To the extent that this is not the case, the threat
fashion or obsolete. The dynamism of theory develop- arises that the explicit or implicit extrapolation models
ment thus does not fit well the meta-analytic requirement may be misspecified, creating the risk for incorrect pro-
for a stable set of mediating constructs or the simultane- jections of effects under novel conditions.
ous comparison of several competing substantive theo- Our discussion has made it clear that extrapolations to
ries. Another relevant factor is that, even if a large num- novel conditions carry a larger degree of uncertainty than
ber of mediational variables have been measured, they generalized inferences to universes from which particu-
are often measured less well and documented less thor- lars have been studied. In this situation, the delicate goal
oughly than measures of cause and effect constructs. of the meta-analyst is to communicate the existing evi-
Measures of mediating variables are given less attention dence and the models from which an extrapolation is
and are sometimes not analyzed unless a molar causal made, the assumptions built into such models, the uncer-
relationship has been established. Given the pressure to tainty of the inference, and the conditions necessary to
publish findings in a timely fashion and to write succinct reduce the uncertainty. For the user of a meta-analysis, it
rather than comprehensive reports, the examination of may then be possible to weigh the risks and harm of an
mediating processes is often postponed, limited to se- incorrect extrapolation against the likelihood and benefits
lected findings, or based on inferior measures. of a correct extrapolation.

28.4.14 Misspecification of Models 28.5 CONCLUSIONS


for Extrapolation
We have argued that the unique and major promise of re-
The most challenging generalization involves inferences search syntheses lies in strengthening empirical general-
to new populations, treatments, outcomes, and settings. izations of associations between classes of treatments and
Such extrapolations to novel conditions have to rely on outcomes. The history of meta-analytic practice, how-
the available empirical evidence and the strength of the ever, has demonstrated that this promise is threatened,
556 SUMMARY

because meta-analysts cannot rely on the two major sci- from an implicit theory about the differential seriousness
entific warrants for generalizability claims, random sam- and prevalence of these threats and can recognize when
pling from designated populations and strong causal ex- the needed substantive and technical expertise and re-
planations. This chapter offers an alternative approach, a sources are on hand. Even so, all research syntheses have
threats-to validity framework, to explore and make a case to make important trade-offs between partially conflict-
for generalized inferences in meta-analysis when the es- ing goals. Thus, should methodologically less rigorous
tablished models cannot be applied. studies be excluded to strengthen causal inferences if
Following Donald Campbell and Lee Cronbach, we this also limits the ability to explore potential moderating
distinguish between three types of generalized inferences conditions and the empirical robustness of effects? Should
central to meta-analyses (Campbell and Stanley 1963; resources be allocated to code fewer study features with
Cook and Campbell 1979; Cronbach 1980). The first deals greater validity or to code more study features that might
with inferences about an association in target universes of capture even more potentially important reasons why ef-
persons, treatments, outcomes, and settings based on par- fect sizes differ? When should one stop searching for
ticulars sampled from these universes. The second con- more relevant studies if they are increasingly fugitive, or
cerns the robustness and conditional nature of an associa- stop trying to obtain missing data from primary study au-
tion across heterogeneous, substantively relevant and thors, or stop checking coder reliability and drift? To the
irrelevant conditions. The third concerns extrapolating or experienced practitioner, the definitive research synthesis
projecting an association to novel, as-yet unstudied uni- is perhaps even more of a fiction than the definitive pri-
verses. The proposed threats-to-validity framework seeks mary study. The goal is research syntheses that answer
to assist meta-analysts in exploring alternative explana- important questions about the generalizability of an as-
tions to the generalized inferences of interest. Because sociation while making explicit the limitations of the
meta-analysts cannot realistically rely on the elegance findings claimed, raising new and important questions
and strength of sampling theory to warrant generalizabil- about the boundaries of our understanding, while also
ity claims, the proposed framework offers a falsification- setting the agenda for the next generation of research.
ist approach to identify and rule out plausible alternative In the first edition of this handbook, we called for the
explanations to justify generalized inferences in meta- development of a viable new theory of generalization
analyses. that could guide the design of research syntheses. We be-
Although the threats-to-validity framework makes no lieve that the principles proposed by Thomas Cook and
assumptions about random sampling or comprehensive further elaborated by William Shadish and his colleagues
causal explanations, it does require that we critically in- can serve as a starting point for such a theory (Cook
vestigate all plausible concerns that can lead to spurious 1990, 1993; Shadish et al. 2002). These principles pro-
inferences and rule out each concern on grounds that it is vide practical guidelines for exploring and building argu-
implausible or based on evidence that it did not operate in ments around a generalizability claim, for which meta-
a specific situation. Similar to the causal inferences based analyses can then provide empirical warrants. They are
on Campbells threats to validity for quasi-experimental not, though, principles firmly ensconced in statistical
designs, the generalized inferences based on the proposed theory, as is the case with probability sampling. But prob-
threats to validity for meta-analyses are tentative and only ability sampling has a limited reach across the persons,
as good as the critical evaluation of the plausible threats. settings, times, and instances of the cause and effect that
The threats we have presented are, in large measure, a are necessarily involved in generalizing from sampling
summary of what other scholars have identified since details to causal claims. For all its undisputed theoretical
Gene Glasss pioneering work (Glass 1976; Smith, Glass, elegance, probability sampling is most relevant to gener-
and Miller 1980). We expect the list to continue to change alizing to populations of units and, sometimes, to set-
at the margin as new threats are identified, current ones tings. Its practical relevance to the other entities involved
are seen as less relevant than we now think, and as meta- in causal generalization is much less clear. Although sig-
analysis is put to new purposes. nificant progress has been made since modern research
To the fledgling meta-analyst, our list of validity threats synthesis methods were introduced more than thirty years
may appear daunting and the proposed remedies over- ago, significant additional progress is needed to achieve
whelming. More experienced practitioners will be less a more complete understanding of how research synthe-
intimidated by the numerous threats because they operate ses can achieve even better causal generalization. The
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 557

necessary condition for this is the existence of one or Chalmers, Thomas C., and Joseph Lau. 1992. Meta-Analy-
more comprehensive and internally cogent theories of sis of Randomized Control Trials Applied to Cancer
causal generalization. But this we do not yet have in any Therapy. In Important Advances in Oncology, 1992, edited
form that leads to novel practical actions when conduct- by Vincent T. Devita, Samuel Hellman and Steven A. Ro-
ing meta-analyses. senberg. Philadephia, Pa.: Lippincott.
Cohn, Lawrence D., and Betsy J. Becker. 2003. How Meta-
28.6 REFERENCES Analysis Increases Statistical Power. Psychological
Methods 8(3): 24353.
Becker, Betsy J. 1992. Models of Science Achievement: Collingwood, Robin G. 1940. An Essay on Metaphysics. Ox-
Forces Affecting Male and Female Performance in School ford: Clarendon Press.
Science. In Meta-Analysis for Explanation: A Casebook, Cook, Thomas D. 1990. The Generalization of Causal
edited by Thomas D. Cook, Harris Cooper, David S. Cor- Connections. In Research Methodology: Strengthening
dray, Heidi Hartmann, Larry V. Hedges, Richard J. Light, Causal Interpretations of Nonexperimental Data, edited by
Thomas A. Louis, and Frederick Mosteller. New York: Rus- Lee Sechrest, E. Perrin and J. Bunker. Washington, D.C.:
sell Sage Foundation. U.S. Department of Health and Human Services.
. 2001. Examining Theoretical Models Through Re- . 1993. Understanding Causes and Generalizing
search Synthesis: The Benefits of Model-Driven Meta- About Them. In New Directions in Program Evaluation,
Analysis. Evaluation and the Health Professions. 24(2): edited by Lee B. Sechrest and A. G. Scott. San Francisco:
190217. Jossey Bass.
Boruch, Robert F. 2005. Comments on Can the Federal Cook, Thomas D., and Donald T. Campbell. 1979. Quasi-
Government Improve Education Research? by Brian Jacob Experimentation: Design and Analysis Issues for Field Set-
and Jacob Ludwig. Brookings Papers on Education Policy tings. Chicago: Rand McNally.
2005(1): 67280. Cook, Thomas D., Harris Cooper, David S. Cordray, Heidi
Campbell, Donald T. 1957. Factors Relevant to the Validity Hartmann, Larry V. Hedges, Richard J. Light, Thomas A.
of Experiments in Social Settings. Psychological Bulletin Louis, and Frederick Mosteller, eds. 1992. Meta-Analysis
54(4): 297312. for Explanation: A Casebook. New York: Russell Sage
. 1986. Relabeling Internal and External Validity for Foundation.
Applied Social Scientist. In Advances in Quasi-Experi- Cronbach, Lee J. 1980. Toward Reform of Program Evalua-
mental Design and Analysis, edited by William M.K. Tro- tion. San Francisco: Jossey-Bass.
chim. San Francisco: Jossey-Bass. . 1982. Designing Evaluations of Educational and
Campbell, Donald T., and Robert F. Boruch. 1975. Making Social Programs. San Francisco: Jossey-Bass.
the Case for Randomized Assignment to Treatments by Devine, Elizabeth C. 1992. Effects of Psychoeducational
Considering the Alternatives: Six Ways in which Quasi-Ex- Care with Adult Surgical Patients: A Theory Probing Meta-
perimental Evaluations Tend to Underestimate Effects. In Analysis of Intervention Studies. In Meta-Analysis for Ex-
Evaluation and Experiment: Some Critical Issues in Asses- planation: A Casebook, edited by Thomas D. Cook, Harris
sing Social Programs, edited by C. A. Bennett and A. A. Cooper, David S. Cordray, Heidi Hartmann, Larry V. Hed-
Lumsdaine. New York: Academic Press. ges, Richard J. Light, Thomas A. Louis, and Frederick
Campbell, Donald T., and Julian C. Stanley. 1963. Experi- Mosteller. New York: Russell Sage Foundation.
mental and Quasi-Experimental Designs for Research. Donner, Allan, Gilda Piaggio, and Jos Villar. 2001. Statis-
Chicago: Rand McNally. tical Methods for the Meta-Analysis of Cluster Randomi-
Chalmers, Thomas C., P. Celano, Henry S. Sacks, and zation Trials. Statistical Methods in Medical Research
H. Smith Jr. 1983. Bias in Treatment Assignment in 10(5): 32538.
Controlled Clinical Trials. New England Journal of Medi- . 2003. Meta-Analyses of Cluster Randomization
cine 309(22): 135861. Trials: Power Considerations. Evaluation and the Health
Chalmers, Thomas C., Peg Hewett, Dinah Reitman, and Professions 26(3): 34051.
Henry S. Sacks. 1989. Selection and Evaluation of Empi- Duval, Sue, and Richard Tweedie. 2000. Trim and Fill: A
rical Research in Technology Assessment. International Simple Funnel-Plot-Based Method of Testing and Adjus-
Journal of Technology Assessment in Health Care 5(4): ting for Publication Bias in Meta-Analysis. Biometrics
52136. 56(2): 45563.
558 SUMMARY

Eddy, David M., Vic Hasselblad, and Ross Shachter. 1990. Robustness of a Random Effects Selection Model. Journal
An Introduction to a Bayesian Method for Meta-Analysis: of Educational and Behavioral Statistics 21(4): 299332.
The Confidence Profile Method. Medical Decision Ma- Hox, Joop J., and Edith D. de Leeuw. 2003. Multilevel Mo-
king 10(1): 1523. dels for Meta-Analysis. In Multilevel Modeling: Methodo-
Egger, Matthias, George Davey Smith, Martin Schneider, logical Advances, Issues, and Applications, edited by Ste-
and Christoph Minder. 1997. Bias in Meta-Analysis De- ven P. Reise and Naihua Duan. Mahwah, N.J.: Lawrence
tected by a Simple, Graphical Test. British Medical Jour- Erlbaum.
nal 315(7109): 62934. Hunter, John E., and Frank L. Schmidt. 1990. Methods of
Furlow, Carolyn F., and S. Natasha Beretvas. 2005. Meta- Meta-Analysis: Correcting Error and Bias in Research Fin-
Analytic Methods of Pooling Correlation Matrices for dings. Newbury Park, Calif.: Sage Publications.
Structural Equation Modeling Under Different Patterns of Jackson, Dan. 2006. The Power of the Standard Test for the
Missing Data. Psychological Methods 10(2): 22754. Presence of Heterogeneity in Meta-Analysis. Statistics in
Gasking, Douglas. 1955. Causation and Recipes. Mind 64: Medicine 25(15): 268899.
47987. Jacob, Brian, and Jens Ludwig. 2005. Can the Federal Go-
Glass, Gene V. 1976. Primary, Secondary, and Meta-Analy- vernment Improve Education Research? Brookings Papers
sis of Research. Educational Researcher 5(10): 38. on Education Policy 2005(1): 4787.
. 1977. Integrating Findings: The Meta-Analysis of Kelley, George A., Kristi S. Kelley, and Zung Vu Tran. 2004.
Research. Review of Research in Education 5(1): 35179. Retrieval of Missing Data for Meta-Analysis: A Practical
Greenwald, Anthony G. 1975. Consequences of Prejudice Example. International Journal of Technology Assessment
Against the Null Hypothesis. Psychological Bulletin 82(1): in Health Care 20(3): 29699.
120. Lipsey, Mark W. 1992. Juvenile Delinquency Treatment: A
Hardy, Rebecca J., and Simon G. Thompson. 1998. Detec- Meta-Analytic Inquiry into the Variability of Effects. In
ting and Describing Heterogeneity in Meta-Analysis. Sta- Meta-Analysis for Explanation: A Casebook, edited by
tistics in Medicine 17(8): 84156. Thomas D. Cook, Harris Cooper, David S. Cordray, Heidi
Harris, Monica J., and Robert Rosenthal. 1985. Mediation Hartmann, Larry V. Hedges, Richard J. Light, Thomas A.
of Interpersonal Expectancy Effects: 31 Meta-Analyses. Louis, and Frederick Mosteller. New York: Russell Sage
Psychological Bulletin 27(3): 36386. Foundation.
Harwell, Michael. 1997. An Empirical Study of Hedges . 2003. Those Confounded Moderators in Meta-
Homogeneity Test. Psychological Methods 2(2): 21931. Analysis: Good, Bad, and Ugly. Annals of the American
Hedges, Larry V. 1982. Estimation of Effect Sizes from a Academy of Political and Social Science 587(May, Special
Series of Independent Experiments. Psychological Bulle- Issue): 6981.
tin 92(2): 49099. Lipsey, Mark W., and David B. Wilson. 2001. Practical
. 1990. Directions for Future Methodology. In The Meta-Analysis. Applied Social Research Methods Series,
Future of Meta-Analysis, edited by Kenneth W. Wachter vol. 49. Thousand Oaks, Calif.: Sage Publications.
and Miron L. Straf. New York: Russell Sage Foundation. Little, Roderick J. A., and Donald B. Rubin. 2002. Statisti-
Hedges, Larry V., and Betsy Jane Becker. 1986. Statistical cal Analysis with Missing Data, 2nd ed. New York: John
Methods in the Meta-Analysis of Research on Gender Dif- Wiley & Sons.
ferences. In The Psychology of Gender: Advances Through Mackie, John L. 1974. The Cement of the Universe: A Study
Meta-Analysis, edited by Janet Shibley Hyde and Marcia C. of Causation. Oxford: Oxford University Press.
Linn. Baltimore, Md.: Johns Hopkins University Press. Matt, Georg E. 1989. Decision Rules for Selecting Effect
Hedges, Larry V., and Ingram Olkin. 1985. Statistical Sizes in Meta-Analysis: A Review and Reanalysis of Psy-
Methods for Meta-Analysis. Orlando, Fl.: Academic Press. chotherapy Outcome Studies. Psychological Bulletin
Hedges, Larry V., and Therese D. Pigott. 2001. The Power 105(1): 10615.
of Statistical Tests in Meta-Analysis. Psychological . 2003. Will it Work in Mnster? Meta-Analysis
Methods 6(3): 20317. and the Empirical Generalization of Causal Relationships.
. 2004. The Power of Statistical Tests for Modera- In Meta-analysis, edited by H. Holling, D. Bhning and
tors in Meta-Analysis. Psychol Methods 9(4): 42645. R. Schulze. Gttingen: Hogrefe and Huber.
Hedges, Larry V., and Jack L. Vevea. 1996. Estimating Effect . 2005. Uncertainty in the Multivariate World: A
Size Under Publication Bias: Small Sample Properties and Fuzzy Look through Brunswiks Lens. In Multivariate
THREATS TO THE VALIDITY OF GENERALIZED INFERENCES 559

Research Strategies. A Festschrift in Honor of Werner . 1990. An Evaluation of Procedure and Results. In
W. Wittmann, edited by Andre Beauducel, Bernhard Biehl, The Future of Meta-Analysis, edited by Kenneth W. Wachter
Michael Bosniak, Wolfgang Conrad, Gisela Schnberger and Miron L. Straf. New York: Russell Sage Foundation.
and Deitrich Wagener. Maastricht: Shaker Publishing. . 1991. Meta-Analytic Procedures for Social Re-
Matt, Georg E., and Thomas D. Cook. 1994. Threats to the search, 2nd ed. Newbury Park, Calif.: Sage Publications.
Validity of Research Syntheses. In The Handbook of Re- Rosenthal, Robert, and Donald B. Rubin. 1982. Comparing
search Synthesis., edited by Harris Cooper and Larry V. Effect Sizes of Independent Studies. Psychological Bulle-
Hedges. New York: Russell Sage Foundation. tin 92(2): 5004.
Mick, Eric, Joseph Biederman, Gahan Pandina, and Ste- . 1986. Meta-Analytic Procedures for Combining
phen V. Faraone. 2003. A Preliminary Meta-Analysis of Studies with Multiple Effect Sizes. Psychological Bulletin
the Child Behavior Checklist in Pediatric Bipolar Disor- 99(3): 4006.
der. Biological Psychiatry 53(11): 102127. Rothstein, Hannah R., Alexander J. Sutton, and Michael Bo-
Moyer, Anne, John W. Finney, Carolyn E. Swearingen, and renstein, eds. 2005. Publication Bias in Meta-Analysis:
Pamela Vergun. 2002. Brief Interventions for Alcohol Pro- Prevention, Assessment and Adjustments. New York: John
blems: A Meta-Analytic Review of Controlled Investiga- Wiley & Sons.
tions in Treatment-Seeking and NonTreatment-Seeking Rubin, Donald B. 1987. Multiple Imputation for Nonres-
Populations. Addiction 97(3): 27992. ponse in Surveys. New York: John Wiley & Sons.
Nam, In-Sum, Kerrie Mengersen, and Paul Garthwaite. Sacks, H. S., J. Berrier, D. Reitman, V. A. Ancona-Berk, and
2003. Multivariate Meta-Analysis. Statistics in Medicine T. C. Chalmers. 1987. Meta-Analyses of Randomized
22(14): 230933. Controlled Trials. The New England Journal of Medicine
Orwin, Robert G., and David S. Cordray. 1985. The Effects 316(8): 45055.
of Deficient Reporting on Meta-Analysis: A Conceptual Saleh, A. K. Ehsanes, Khatab M. Hassanein, Ruth S. Hassa-
Framework and Reanalysis. Psychological Bulletin 97(1): nein, and H. M. Kim. 2006. Quasi-Empirical Bayes
13447. Methodology for Improving Meta-Analysis. Journal of
Premack, Steven L., and John E. Hunter. 1988. Individual Biopharmaceutical Statistics 16(1): 7790.
Unionization Decisions. Psychological Bulletin 103(2): Snchez-Meca, Julio, Fulgencio Marn-Martnez, and Salva-
22334. dor Chacon-Moscoso. 2003. Effect-Size Indices for Di-
Prevost, Teresa C., Keith R. Abrams, and David R. Jones. chotomized Outcomes in Meta-Analysis. Psychological
2000. Hierarchical Models in Generalized Synthesis of Methods 8(4): 44867.
Evidence: An Example Based on Studies of Breast Cancer Schulze, Ralf. 2004. Meta-Analysis: A Comparison of Ap-
Screening. Statistics in Medicine. 19(24): 3359376. proaches. New York: Hogrefe and Huber.
Raudenbush, Stephen W., Betsy J. Becker, and Hripsime Ka- Scott, J. Cobb, Steven Paul Woods, Georg E. Matt, Rachel A.
laian. 1988. Modeling Multivariate Effect Sizes. Psycho- Meyer, Robert K. Heaton, J. Hampton Atkinson, and Igor
logical Bulletin 103(1): 11120. Grant. 2007. Neurocognitive Effects of Methampheta-
Raudenbush, Stephen W., and Anthony S. Bryk. 1985. Em- mine: A Critical Review and Meta-Analysis. Neuropsy-
pirical Bayes Meta-Analysis. Journal of Educational Sta- chology Review 17(3): 27597.
tistics 10(2): 7598. Shadish, William R. 1992. Do Family and Marital Thera-
Robinson, Leslie A., Jeffrey S. Berman, and Robert A. Nei- pies Change What People Do? A Meta-Analysis of Beha-
meyer. 1990. Psychotherapy for the Treatment of Depres- vioral Outcomes. In Meta-Analysis for Explanation: A
sion: A Comprehensive Review of Controlled Outcome Casebook, edited by Thomas D. Cook, Harris Cooper,
Research. Psychological Bulletin 108(1): 3049. David S. Cordray, Heidi Hartmann, Larry V. Hedges, Ri-
Rosch, Eleanor. 1973. Natural Categories. Cognitive Psy- chard J. Light, Thomas A. Louis, and Frederick Mosteller.
chology 4: 32850. New York: Russell Sage Foundation.
. 1978. Principles in Categorization. In Cognition Shadish, William R., Thomas D. Cook, and Donald T. Camp-
and Categorization, edited by Eleanor Rosch and Barbara B. bell. 2002. Experimental and Quasi-Experimental Design
Lloyd. Hillsdale, N.J.: Lawrence Erlbaum. for Generalized Causal Inference. Boston, Mass.: Hough-
Rosenthal, Robert. 1979. The File Drawer Problem and To- ton Mifflin.
lerance for Null Results. Psychological Bulletin 86(3): Shadish, William R., Xiangen Hu, Renita R. Glaser, Richard
63841. Kownacki, and Seok Wong. 1998. A Method for Exploring
560 SUMMARY

the Effects of Attrition in Randomized Experiments with Sutton, Alexander J., Fujian Song, Simon M. Gilbody, and
Dichotomous Outcomes. Psychological Methods 3(1): Keith R. Abrams. 2000. Modelling Publication Bias in
322. Meta-Analysis: A Review. Statistical Methods in Medical
Shadish, William R., Jason K. Luellen, and M. H. Clark. Research 9(5): 42145.
2006. Propensity Scores and Quasi-Experiments: A Testi- Tabarrok, Alexander T. 2000. Assessing the FDA via the
mony to the Practical Side of Lee Sechrest. In Strengthe- Anomaly of Off-Label Drug Prescribing. The Independent
ning Research Methodology: Psychological Measurement Review V(1): 2553.
and Evaluation, edited by Richard Bootzin and Patrick E. Twenge, Jean M., and W. Keith Campbell. 2001. Age and
McKnight. Washington, D.C.: American Psychological As- Birth Cohort Differences in Self-Esteem: A Cross-Tempo-
sociation. ral Meta-Analysis. Personality and Social Psychology Re-
Shadish, William R., Georg E. Matt, Ana M. Navarro, and view 5(4): 32144.
Glenn Phillips. 2000. The Effects of Psychological Thera- van Houwelingen, Hans C., Lidia R. Arends, and Theo Sti-
pies Under Clinically Representative Conditions: A Meta- jnen. 2002. Advanced Methods in Meta-Analysis: Multi-
Analysis. Psychological Bulletin 126(4): 51229. variate Approach and Meta-Regression. Statistics in Medi-
Shadish, William R., and Rebecca B. Sweeney. 1991. Me- cine 21(4): 589624.
diators and Moderators in Meta-Analysis: Theres a Reason Vevea, Jack L., and Larry V. Hedges. 1995. A General
We Dont Let Dodo Birds Tell Us Which Psychotherapies Linear Model for Estimating Effect Size in the Presence of
Should Have Prizes. Journal of Consulting Clinical Psy- Publication Bias. Psychometrika 60(3): 41935.
chology 59(6): 88393. Vevea, Jack L., and Carol M. Woods. 2005. Publication
Shapiro, David A., and Diana Shapiro. 1982. Meta-Analysis Bias in Research Synthesis: Sensitivity Analysis Using a
of Comparative Therapy Outcome Studies: A Replication Priori Weight Functions. Psychological Methods 10(4):
and Refinement. Psychological Bulletin 92(3): 581604. 42843.
Sinharay, Sandip, Hal S. Stern, and Daniel Russell. 2001. Vevea, Jack L., Nancy C. Clements, and Larry V. Hedges.
The Use of Multiple Imputation for the Analysis of Mis- 1993. Assessing the Effects of Selection Bias on Validity
sing Data. Psychological Methods 6(4): 31729. Data for the General Aptitude Test Battery. Journal of Ap-
Slavin, Robert E. 1986. Best-Evidence Synthesis: An Alter- plied Psychology 78(6): 98187.
native to Meta-Analytic and Traditional Reviews. Educa- Viechtbauer, Wolfgang. 2007. Confidence Intervals for the
tional Researcher 15(9): 511. Amount of Heterogeneity in Meta-Analysis. Statistics in
Smith, Edward E., and Douglas L. Medin. 1981. Categories Medicine 26(1): 3752.
and Concepts. Cambridge, Mass.: Harvard University Webb, Eugene J., Donald T. Campbell, Richard D. Schwartz,
Press. Lee Sechrest, and Janet B. Grove. 1981. Nonreactive Measu-
Smith, Mary Lee, Gene V. Glass, and Thomas I. Miller. 1980. res in the Social Sciences, 2nd ed. Chicago: Rand McNally.
The Benefits of Psychotherapy. Baltimore, Md.: Johns Ho- West, Stephen G., Jeremy C. Biesanz, and Steven C. Pitts.
pkins University Press. 2000. Causal Inference and Generalization in Field Set-
Sterne, Jonathan A. C., and Matthias Egger. 2001. Funnel tings : Experimental and Quasi-Experimental Designs. In
Plots for Detecting Bias in Meta-Analysis: Guidelines on Handbook of Research Methods in Social and Personality
Choice of Axis. Journal of Clinical Epidemiology 54(10): Psychology, edited by H.T. Reid and C.M. Judd. New York:
104655. Cambridge University Press.
Sutton, Alexander J. 2000. Methods for Meta-Analysis in Whitbeck, Caroline. 1977. Causation in Medicine: The Di-
Medical Research. Wiley Series in Probability and Mathe- sease Entity Model. Philosophy of Science 44(4): 61937.
matical Statistics. New York: John Wiley & Sons. Wittmann, Werner W., and Georg E. Matt. 1986. Meta-Ana-
Sutton, Alexander J., and Keith R. Abrams. 2001. Bayesian lyse als Integration von Forschungsergebnissen am Beis-
Methods in Meta-Analysis and Evidence Synthesis. Sta- piel deutschsprachiger Arbeiten zur Effektivitt von Psy-
tistical Methods in Medical Research 10(4): 277303. chotherapie. Psychologische Rundschau 37(1): 2040.
Sutton, Alexander J., Susan J. Duval, Richard L. Tweedie, Wolf, Frederic M. 1990. Methodological Observations
Keith R. Abrams, and David R. Jones. 2000. Empirical on Bias. In The Future of Meta-Analysis, edited by Ken-
Assessment of Effect of Publication Bias on Meta- neth W. Wachter and Miron L. Straf. New York: Russell
Analyses. British Medical Journal 320(7249): 157477. Sage Foundation.
29
POTENTIALS AND LIMITATIONS

HARRIS COOPER LARRY V. HEDGES


Duke University Northwestern University

CONTENTS
29.1 Introduction 562
29.2 Unique Contributions of Research Synthesis 562
29.2.1 Increased Precision and Reliability 562
29.2.2 Testing Generalization and Moderating Influences 563
29.2.3 Asking Questions about the Scientific Enterprise 563
29.3 Limitations of Research Synthesis 563
29.3.1 Correlational Nature of Synthesis Evidence 563
29.3.2 Post Hoc Nature of Synthesis Tests 563
29.3.3 Need for New Primary Research 564
29.4 Developments in Research Synthesis Methodology 564
29.4.1 Improving Research Synthesis Databases 564
29.4.2 Improving Analytic Procedures 565
29.4.2.1 Missing Data 565
29.4.2.2 Publication Bias 565
29.4.2.3 Fixed- Versus Random-Effects Models 566
29.4.2.4 Bayesian Methods 567
29.4.2.5 Causal Modeling in Research Synthesis 567
29.5 Criteria for Judging the Quality of Research Syntheses 568
29.6 References 571

561
562 SUMMARY

29.1 INTRODUCTION larities are what capture attention. Perhaps most striking
are the improvements in precision and reliability that new
We suspect that readers of this volume have reacted to it techniques have made possible.
in one of two ways: they have been overwhelmed by the Precision in literature searches has been improved dra-
number and complexity of the issues that face a research matically by the procedures described here. Yesterdays
synthesist or they have been delighted to have a manual synthesist attempted to uncover a literature by examining
to help them through the synthesis process. We further a small journal network and making a few telephone calls.
suspect that which reaction it was depended on whether This strategy is no longer viable. The proliferation of jour-
the book was read while thinking about or while actually nals and e-journals and the increase in the sheer quantity
performing a synthesis. In the abstract, the concerns raised of research renders the old strategy inadequate. Today, the
in this handbook must seem daunting. Concretely, how- computer revolution, the development and maintenance
ever, research syntheses have been carried out for decades of comprehensive research registries, and the emergence
and will be for decades to come, and the problems en- of the Internet mean that a synthesist can reach into thou-
countered in their conduct do not really go away when sands of journals and individual researchers data files.
they are ignored. Knowledge builders need a construction The present process is far from perfect, of course, but it is
manual to accompany their blueprints and bricks. an enormous improvement over past practice.
Further, there is no reason to believe that carrying out The coding of studies and their evaluation for quality
a sound research synthesis is any more complex than car- have also come a long way. The old process was rife with
rying out sound primary research. The analogies between potential bias. In 1976, Gene Glass wrote this:
synthesis and survey research, content analysis, or even
quasi-experimentation fit too well for this not to be the A common method for integrating several studies with in-
case. Certainly, each type of research has its unique char- consistent findings is to carp on the design or analysis defi-
acteristics. Still, when research synthesis procedures have ciencies of all but a few studiesthose remaining fre-
become as familiar as primary research procedures much quently being ones own work or that of ones students and
of what now seems overwhelming will come to be viewed friendsand then advancing the one or two acceptable
as difficult, but manageable. studies as the truth of the matter. (4)
Likewise, expectations will change. As we mentioned
in chapter 1, we do not expect a primary study to provide Today, research synthesists code their studies with
unequivocal answers to research questions, nor do we ex- considerably more care to avoid bias. Explicit, well-de-
pect it to be performed without flaw. Research syntheses fined coding frames are used by multiple coders. Judg-
are the same. The perfect synthesis does not and never ments about quality of research are sometimes eschewed
will exist. Synthesists who set standards that require them entirely, allowing the data to determine if research design
to implement all of the most rigorous techniques de- influences study outcomes.
scribed in this book are bound to be disappointed. Time Finally, the process of integrating and contrasting pri-
and other resources (not to mention the logic of discov- mary study results has taken a quantum leap forward.
ery) will prevent such an accomplishment. The synthe- Gone are the days when a vote count of significant results
sist, like the primary researcher, must balance method- decided an issue. It is no longer acceptable to string to-
ological rigor against practical feasibility. gether paragraph descriptions of studies, with details of
significance tests, and then conclude that the data are
29.2 UNIQUE CONTRIBUTIONS OF RESEARCH inconsistent, but seem to indicate. Todays synthesist can
SYNTHESIS provide confidence intervals around effect size estimates
for separate parts of a literature distinguished by both
29.2.1 Increased Precision and Reliability methodological and theoretical criteria calculated several
different ways using different assumptions about the ad-
Some comparisons were made between primary research equacy of the literature search and coding scheme, and
and research synthesis. It was suggested that the two alternative statistical models.
forms of study have much in common. In contrast, when These are only a few of the ways that the precision and
we compare research synthesis as it was practiced before reliability of current syntheses outstrip those of their pre-
meta-analytic techniques were popularized the dissimi- decessors. The increases in precision and reliability that a
POTENTIALS AND LIMITATIONS 563

synthesist can achieve when a body of literature grows general societal conditions in which it takes place. By fa-
from two to twenty to two hundred studies were squan- cilitating this form of self-analysis, current synthesis
dered with yesterdays methods. methods can be used to advance not only particular prob-
lem areas but also entire disciplines, by addressing more
29.2.2 Testing Generalization and general questions in the psychology and sociology of sci-
Moderating Techniques ence.

Current methods not only capitalize on accumulating evi-


dence by making estimates more precise, but also permit 29.3 LIMITATIONS OF RESEARCH SYNTHESIS
the testing of hypotheses that may have never been tested
in primary studies. This advantage is most evident when 29.3.1 Correlational Nature of
the objective of the synthesist is to establish the general- Synthesis Evidence
ity or specificity of results. The ability of a synthesist to
Research syntheses have limitations as well. Among the
test whether a finding applies across settings, times, peo-
most critical is the correlational nature of synthesis-gen-
ple, and researchers surpasses by far the ability of most
erated evidence. As detailed in chapter 2, study-generated
primary studies. However, for yesterdays synthesist, the
evidence appears when individual studies contain results
variation among studies was a nuisance. It clouded inter-
that directly test the relation under consideration. Synthe-
pretation. The old methods of synthesis were handicapped
sis-generated evidence appears when the results of stud-
severely when it came to judgments of whether a set of
ies using different procedures to test the same hypothesis
studies revealed consistent results and, if not, what might
are compared. As Harris Cooper notes in chapter 2, only
account for the inconsistency. Yesterdays synthesist often
study-generated evidence can be used to support claims
chose the wrong datum (significance tests rather than ef-
about causal relations and only when experimental de-
fect sizes) and the wrong evaluation criteria (implicit
signs are used in them. Specific confounds can be con-
cognitive algebra rather than formal statistical test) on
trolled statistically at the level of synthesis-generated
which to base these judgments. For todays synthesist,
evidence, but the result can never lead to the same confi-
variety is the spice of life. The methods described in this
dence in inferences produced by study-generated evi-
handbook make such analyses routine. When the outcomes
dence from investigations employing random assignment.
of studies prove too discrepant to support a claim that one
This is an inherent limitation of research syntheses. It also
estimate of a relationship underlies them all, current tech-
provides the synthesist with a fruitful source of sugges-
niques provide the synthesist with a way of systemati-
tions concerning directions for future primary research.
cally searching for moderating influences, using consis-
tent and explicit rules of evidence.
29.3.2 Post Hoc Nature of Synthesis Tests
29.2.3 Asking Questions About the
Scientific Enterprise Researchers interested in conducting syntheses almost
always set upon the process with some, if not consider-
Finally, the practice of research synthesis permits scien- able, knowledge of the empirical data base they are about
tists to study their own enterprise. Questions can be asked to summarize. They begin with a good notion of what the
about trends in research over time, both trends in results literature says about which interventions work, which
and trends in how research is carried out. Kenneth Otten- moderators operate, and which explanatory variables are
bacher and Harris Cooper, for example, examined the ef- helpful. This being so, the synthesist cannot state a hy-
fects of special versus regular class placement on the so- pothesis so derived and then also conclude that the same
cial adjustment of children with mental handicaps (1984). evidence supports the hypothesis. As Kenneth Wachter
They found that studies of social adjustment conducted and Miron Straf pointed out, once data have been used to
after 1968, when mainstreaming became the law, favored develop a theory they cannot be used to test it (1990).
regular class placement, whereas studies before 1968 fa- Synthesists who wish to test specific, a priori hypotheses
vored special classes. in meta-analysis must go to extra lengths to convince
Synthesists often uncover results that raise interesting their audience that the generation of the hypothesis and
questions about the relationship between research and the the data used to test it are truly independent.
564 SUMMARY

29.3.3 Need for New Primary Research suggested that journals need to make space for authors to
report the means and standard deviations for all condi-
Each of these limitations serves to underscore an obvious tions, for the reliability of measures (and for full texts of
point: a research synthesis should never be considered a measures when they are not previously published), and
replacement for new primary research. Primary research for range restrictions in samples. Further, as primary sta-
and research synthesis are complementary parts of a com- tistical analyses become more complex, the requirements
mon practice; they are not competing alternatives. The of adequate reporting will become more burdensome. Al-
additional precision and reliability brought to syntheses though means and standard deviations are adequate for
by the procedures described in this text do not make re- simple analyses, full correlation matrices (or even sepa-
search integrations substitutes for new data-gathering ef- rate correlation matrices for each of several groups stud-
forts. Instead, they help ensure that the next wave of pri- ied) may also be required for adequate reporting of mul-
mary research is sent off in the most illuminating direction. tivariate analyses. Reporting this detailed information in
A synthesis that concludes a problem is solved, that no journals, however, is less feasible because it would re-
future research is needed, should be viewed with extreme quire even more journal space. Yet primary analyses are
skepticism. Even the best architects can see how the not fully interpretable (and synthesis is certainly not pos-
structures they have built can be improved. sible) without them. Therefore, meta-analysts need some
system whereby authors submit extended results sections
29.4 DEVELOPMENTS IN RESEARCH SYNTHESIS along with their manuscripts and these sections are made
METHODOLOGY available on request.
Two recent developments have greatly improved the
29.4.1 Improving Research Synthesis Databases availability of comprehensive descriptions of primary re-
search, both of which go considerably beyond our pro-
Two problems that confront research synthesists recurred posals. The first involves the establishment of standards
throughout this text. They concern the comprehensive- for reporting of studies in scientific journal articles. For
ness of literature searches, especially as searches are in- example, spurred on by a concern regarding the adequacy
fluenced by publication bias, and missing data in research of reporting of randomized clinical trials in medicine and
reports. Each of these issues was treated in its own chap- health care, the Consolidated Standards for Reporting
ter, but many authors felt the need to consider their im- Trials, or CONSORT, statement was developed by a group
pact in the context of other tasks the synthesist carried of international clinical trialists, statisticians, epidemiol-
out. The correspondence between the studies that can be ogists, and biomedical editors. It has since been adopted
accessed by the synthesist and the target population of as a reporting standard for numerous journals in medicine
studies is an issue confronted during data collection, and public health and by a growing number of journals in
analysis, and interpretation. Missing data is an issue in the social sciences. CONSORT consists of a checklist
data coding, analysis, and interpretation. and flow diagram for reporting studies that employ ran-
There is little need to reiterate the frustration synthe- dom assignment of participants to two treatment groups.
sists feel when they contemplate the possible biases in Data collected by the CONSORT Group does indicate
their conclusions caused by studies and results they can- that adoption of the standards improves the reporting of
not retrieve. Instead, we note that the first edition of this study design and implementation (Moher, Schulz, and
book made a call for improvements in research synthesis Altman 2001).
procedures related to ways to restructure the scientific The CONSORT statement was shortly followed by an
information delivery systems so that more of the needed effort to improve the reporting of quasi-experiments.
information becomes available to the research synthe- Brought together by the Centers for Disease Control, a
sist. This has occurred, as has improvements in tech- group of editors of journals that published HIV behav-
niques for estimating and addressing problems in miss- ioral intervention studies developed and proposed the use
ing data. of a checklist they named TREND (Transparent Report-
Still, the problem is far from solved. In the first edition ing of Evaluations with Nonrandomized Designs) (Des
of this handbook, we suggested that research synthesists Jarlais, Lyons, and Crepaz 2004). The TREND checklist
should call for an upgrading of standards for reporting of consists of twenty-two items. Its impact on reporting prac-
primary research results in professional journals. We also tices is yet to be ascertained.
POTENTIALS AND LIMITATIONS 565

The American Educational Research Association ment of new statistical methods or the adaptation of meth-
(AERA) has recently provided its own set of standards ods used elsewhere are needed. Perhaps as important as
for reporting on empirical social science research (AERA the addition of new methods is the wise use of the consid-
2006). These standards are meant to encompass the broad- erable repertoire of analytic methods already available to
est range of research approaches, from experiments using the research synthesist. We mention here a few analytic
random assignment to qualitative investigations. The fi- issues that arise repeatedly throughout this handbook. The
nally, the American Psychological Association offered a first two relate to the database problems just mentioned.
set of reporting recommendations for psychology (APA 29.4.2.1 Missing Data Perhaps the most pervasive
Publications and Communications Board Working Group practical problem in research synthesis is that of missing
on Journal Article Reporting Standards 2008). data, which obviously influences any statistical analyses.
The second development with enormous potential for Even though missing data as a general problem was ad-
resolving the problem of missing data is the establish- dressed directly in chapter 21, and the special case of
ment of auxiliary web sites by journals on which authors missing data attributable to publication bias in chapter 23,
archive information about their research that could not be the problems created by missing data arose implicitly
included in journal reports because of space limitations. throughout this handbook. For example, the prevalence of
For example, the American Psychological Association missing data on moderator and mediating variables influ-
provides its journals with websites for this purpose. The ences the degree to which the problems investigated by a
sites are referred to in the published articles and access is synthesis can be formulated. Missing data influences the
free to anyone who reads the journal. data collection stage of a research synthesis by affecting
Journal reporting standards and auxiliary data storage, the potential representativeness of studies in which com-
however, are only measures for addressing the problem of plete data are available. It also affects coding certainty and
missing data. They do not address the issue of publication reliability in studies with partially complete information.
bias. A partial solution to publication bias is the contin- Missing data provides an especially pernicious problem
ued development of research registries. Because research as research syntheses turn to increasingly sophisticated
registers attempt to catalog investigations both when they analytic methods to handle dependencies and multivari-
are initiated and when they are completed, they are a ate causal models (see chapters 19 and 20, this volume).
unique opportunity for overcoming publication bias. In We believe that one of the most important emerging
addition, they may be able to provide meta-analyses with areas in research synthesis is the development of analytic
more thorough reports of study results. methods to better handle missing data. Much of this work
Of course, registers are not a panacea. Registers are will likely be in the form of adapting methods developed
helpful only if the researchers who register their studies in other areas of statistics to the special requirements of
keep and make available the results of their studies to research synthesis. These methods will produce more ac-
synthesists who request them. Thus we renew our call in curate analyses when data are not missing completely at
the earlier edition for professional standards that promote random (that is, are not a random sample of all data), but
data sharing of even unpublished research findings. are well enough related to observed study characteristics
Still, forward-looking researchers, publishers, and fund- that they can be predicted reasonably well with a model
ing agencies would be well advised to look toward adopt- based on data that are observed (that is, are what Little
ing reporting standards and creating auxiliary websites and Rubin would call missing at random). The methods
and research registers as some of the most effective ways for model-based inferences with missing data using the
to foster the accumulation of knowledge. Given the perva- EM algorithm are discussed in section 21.5.1 and those
siveness of the publication bias and missing data prob- based on multiple imputation in section 21.5.2 of this vol-
lems, these efforts seem to hold considerable promise for ume (see also Little and Rubin 2002; Rubin 1987).
helping research synthesists find the missing science. 29.4.2.2 Publication Bias Publication bias is a form
of missing data less easy to address with analytic ap-
29.4.2 Improving Analytic Procedures proaches. In the last decade progress in the development
of analytic models for studying and correcting publication
Many aspects of analytic procedures in research synthesis bias by analytic means has been substantial. As might be
have undergone rapid development in the last two decades. expected, publication bias is easier to detect than to correct
However, there are still many areas in which the develop- (at least to correct in a way that inspires great confidence).
566 SUMMARY

Also as expected, most methods for either detecting or this handbook (14, 15, and 16)namely, the criterion of
correcting for publication bias are rather weak unless the empirical adequacy of the fixed effects model. In each of
number of studies is large. Moreover, because publica- these chapters, the authors refer to test of homogeneity,
tion bias results in what Rod Little and Donald Rubin model specification, of goodness of fit of the fixed-effects
would call nonignorable missing data, methods to cor- model. These tests essentially compare the observed (re-
rect for publication bias necessarily depend on the form sidual) variation of effect-size estimates with the amount
of the model for the publication selection process that that would be expected if the fixed-effect model fit the
produces the missing data (2002). Even the most sophis- parameter values exactly. All of the authors argued that
ticated and least restrictive of these methods (the non- the use of fixed-effect models was more difficult to jus-
parametric weight function methods, for example) make tify when tests of model specification showed substantial
the assumption that selection depends only on the p-value variation in effect sizes that was not explained by the
(for a discussion of these methodologies, see chapter 23, fixed effects model as specified. Conversely, fixed effects
this volume). models were easier to justify when tests of model specifi-
These emerging methods are promising, but even cation showed no excess (residual) variation. As noted in
their most enthusiastic advocates suggest that it is better section 14.3, this empirical criterion, though important, is
not to have to correct for publication bias in the first secondary to the theoretical consideration of how best to
place. Thus the best remedy for publication bias is the conceive the inference.
development of invariant sampling frames that would Some authors prefer the use of fixed-effects models
arise through research registers. This nonanalytic devel- under almost any circumstance, regarding unexplained
opment in methodology could be a better substitute for variation in effect-size parameters as seriously compro-
improved statistical corrections in areas in which it is mising the certainty of inferences drawn from a synthesis
feasible. (see Peto 1987; section 15.1, this volume). Other authors
29.4.2.3 Fixed- Versus Random-Effects Models The advocate essentially a random-effects approach under all
contrast between fixed- and random-effects models has circumstances. The random-effects model underlies all
been drawn repeatedly throughout this handbookdi- the methods discussed in chapter 16 of this volume, and
rectly in chapters 14, 15, and 16; indirectly in chapters is consistent with the Bayesian methods also discussed.
19, 20, 21, and 23; and implicitly in chapters 17 and 27. In both cases, the implicit argument was conceptual,
The issues in choosing between fixed- and random-ef- namely, that studies constituted exchangeable units from
fects models are subtle, and sound choices often involve some universe of similar studies. In both cases, the object
considerable subjective judgment. In all but the most ex- was to discover the mean and variance of effect-size pa-
treme cases, reasonable and intelligent people might dis- rameters associated with this universe.
agree about their choices of modeling strategy. However, One difference is that the Bayesian interpretation of
there is much agreement about criteria that might be used random-effects models draws on the idea that between-
in these judgments. studies variation can be conceived as uncertainty about the
The criterion offered in chapter 14 and reiterated in effect-size parameter (see section 16.2.2). This interpreta-
chapter 16 for the choice between fixed- and random-ef- tion of random-effects models was also used in the treat-
fects models is conceptual. Meta-analysts should choose ment of synthetic regression models in chapter 20 (section
random-effects models if they think of studies as differ- 20.3.4.4), where the use of a random-effects model was
ent from one another in ways too complex to capture by a explicitly referred to as adding a between-studies compo-
few single study characteristics. If the object of the analy- nent of uncertainty to the analysis.
sis is to make inferences about a universe of such diverse Yet another empirical criterion, offered in chapter 16,
studies, the random-effects approach is conceptually sen- was based on the number of studies. Random-effects
sible. Synthesists should choose fixed-effects models if models are advocated when the number of studies is large
they believe that the important variation in effects is a enough to support such inferences. The reason for choos-
consequence of a small number of study characteristics. ing random-effects models (when empirically feasible)
In the fixed-effects analysis the object is to draw infer- was conceptualthe belief that study effect sizes would
ences about studies with similar characteristics. be different across studies and that those differences
A practical criterion for choosing between fixed- and would be controlled by a large number of unpredictable
random-effects models is offered in several chapters in influences.
POTENTIALS AND LIMITATIONS 567

Thus, though there are differences among the details of that value, that is, the posterior probability (see Du-
the criteria given, the positions of these authors reflect a Mouchel 1990; Hedges 1998).
consensus on criteria. Conceptual criteria would be ap- A second reason for the appeal of Bayesian methods is
plied first, with the model chosen according to the goals that research synthesis, like any scientific endeavor, has
of the inference. Only after considering the objective of many subjective aspects. Bayesian statistical analyses,
the inference would empirical criteria influence modeling designed to formally incorporate subjective knowledge
strategy, perhaps by forcing a reformulation of the ques- and its uncertainty, have become increasingly important
tion to be addressed in the synthesis. The empirical find- in many areas of applied statistics. We expect research
ing of persistent heterogeneity among the effect sizes synthesis to be no exception. As quantitative research syn-
might force reconsideration of a fixed-effects analysis. theses are more widely used, we expect to see a concom-
Similarly, the ambition to use a random-effects approach itant development of Bayesian methods suited to particu-
to draw inferences about a broad universe of studies might lar applications. One advantage to Bayesian methods is
need to be reconsidered if the collection of empirical re- that they can be used to combine hard data on some as-
search results was too small for meaningful inference, to pects of a problem with subjective assessments concern-
be replaced by fixed-effects comparisons among particu- ing other aspects of the problem about which there are no
lar studies. hard data. One such situation arises when the inference
A particularly interesting area of development is ap- that is desired depends on several links in a causal chain,
plying random-effects models to the problem estimating but not all of the links have been subject to quantitative
the results of populations of studies that may not be iden- empirical research (see Eddy, Hasselblad, and Schacter
tical in composition to the study sample (see U.S. Gen- 1992).
eral Accounting Office 1992; Boruch and Terhanian 29.4.2.5 Causal Modeling in Research Synthesis
1998). For example, in medical research, the patients (or Structural equation modeling is destined to become more
doctors or clinics) selected to participate in clinical trials important in research synthesis for two reasons. First, the
are typically not representative of the general patient use of multivariate causal models in primary research will
population. Although the studies as conducted may be necessitate their greater use in research synthesis. Sec-
perfectly adequate for assessing whether a new treatment ond, as research synthesists strive for greater explanation,
can produce treatment effects that are different from it seems likely that they will turn to structural equation
zero, they may not give a representative sample for as- models estimated using underlying matrices constructed
sessing the likely effect in the nation if the new treatment through the meta-analytic integration of its elements
became the standard. Simply extrapolating from the re- whether or not they have been used in the underlying pri-
sults of the studies conducted would not be warranted if mary research literature (see Cook et al. 1992). The basic
the treatment was less effective with older or sicker pa- methodology for the meta-analysis of models is given in
tients than were typically included in the trials. Efforts to chapter 20. We expect the work given there to be the basis
use information about systemic variation in effect sizes of many meta-analyses designed to probe the fit between
to predict the impact of a new medical treatment (or any proposed causal explanations and the observed data (for a
other innovation) if it were adopted are an exciting ap- discussion of the relationship between synthesis-gener-
plication of research synthesis. ated evidence and inferences about causal relationships,
29.4.2.4 Bayesian Methods There has been an in- see chapter 2).
creasing interest in Bayesian methods in research synthe- A different and entirely new aspect of this methodol-
sis. One reason is the recognition that random-effects ogy is that it has been used to examine the generalizability
analyses depend on the (unknown) effect-size variance of research findings over facets that constitute differences
component, and that classical analyses are based on a in study characteristics (Becker 1996). For example, the
single (estimated) value of that variance component. Yet method has been used to estimate the contribution of be-
predicating an analysis on any single estimated value in- tween-measures variation to the overall uncertainty (stan-
adequately conveys the uncertainty of the estimated vari- dard error) of the relationship or partial relationship be-
ance component, which is often quite large. Bayesian tween variables. This application of models in synthesis is
analyses can incorporate that uncertainty by computing a exciting because it provides a mechanism for evaluating
weighted average over the likely values of the variance quantitatively the effect of construct breadth on general-
component with weight proportional to the likelihood of izability of research results. Because broader constructs
568 SUMMARY

(those admitting a more diverse collection of operations) on papers, which were subjected to an informal consent
will typically lead to operations variation, syntheses based analysis. The dimensions that readers mentioned most fre-
on broader constructs will typically be associated with quently were organization, writing style, clarity of focus,
greater overall uncertainty. However, broader constructs use of citations, attention to variable definitions, attention
will have a larger range of application. Quantitative anal- to study methodology, and manuscript preparation.
yses of generalizability provide a way not only of under- Interestingly, researchers in medicine and public health
standing the body of research that is synthesized, but also have focused more attention on the development of con-
of planning the measurement strategy of future research sumer-oriented evaluation devices for meta-analysis than
studies to ensure adequate generalizability across the con- social scientists have. Still, the resulting scales and check-
structs of interest. lists are not numerous and efforts to construct them have
been only partly successful. For example, after a thor-
29.5 CRITERIA FOR JUDGING THE QUALITY ough search of the literature a report prepared by the
OF RESEARCH SYNTHESES Agency for Healthcare Research and Quality (see West
et al. 2002) identified one research synthesis quality scale
In chapter 28, Georg Matt and Thomas Cook began the (Oxman and Guyatt 1991) and ten sources labeled check-
process of extracting wisdom from technique. They at- lists, though two of these sources are essentially text-
tempted to help synthesists take the broad view of their books on conducting research syntheses for medical and
work, to hold it up to standards that begin answering the health researchers (Clarke and Oxman 1999; Khan et al.
question, What have I learned? 2000; see table 29.1).
The criteria for quality syntheses can vary with the Cooper also developed an evaluative checklist for con-
needs of the reader. Susan Cozzens found that readers sumers of syntheses of social science research (2007).
using literature reviews to follow developments within Following the plan of this handbook, the process of re-
their area of expertise valued comprehensiveness and did search synthesis was divided into six stages:
not consider important the reputation of the author (1981).
However, if the synthesis was outside the readers area, 1. defining the problem,
the expertise of the author was important, as was brevity.
2. collecting the research evidence,
Kenneth Strike and George Posner also addressed the
quality of knowledge synthesis with a wide-angle lens 3. evaluating the correspondence between the meth-
(1983). They suggested that a valuable knowledge synthe- ods and implementation of individual studies and
sis must have both intellectual quality and practical utility. the desired inferences of synthesis,
A synthesis should clarify and resolve issues in a literature 4. summarizing and integrating the evidence from in-
rather than obscure them. It should result in a progressive dividual studies,
paradigm shift, meaning that it brings to a theory greater
5. interpreting the cumulative evidence, and
explanatory power, to a practical program expanded scope
of application, and to future primary research an increased 6. presenting the research synthesis methods and re-
capacity to pursue unsolved problems. Always, a knowl- sults.
edge synthesis should answer the question asked; readers
should be provided a sense of closure. Table 29.2 presents the twenty questions Cooper of-
In fact, readers have their own criteria for what makes fered as most essential to evaluating social science re-
a good synthesis. Harris Cooper asked postmasters de- search syntheses (2007). They are written from the point
gree graduate students to read meta-analyses and to eval- of view of a synthesis consumer and each is phrased so
uate five aspects of the papers (1986). The results revealed that an affirmative response means confidence could be
that readers judgments were highly intercorrelated: syn- placed in that aspect of the synthesis methodology. Al-
theses perceived as superior on one dimension were also though the list is not exhaustive, most of the critical issues
perceived as superior on others. Further, the best predic- discussed in this volume are expressed in the questions, as
tor of the readers quality judgments was their confidence are the dimensions used in medical and health checklists
in their ability to interpret the results. Syntheses seen as that seem most essential to work in the social sciences.
more interpretable were given higher quality ratings. It is important to make three additional points about
Readers were also asked to provide open-ended comments the checklist. First, it does not use a scaling procedure
POTENTIALS AND LIMITATIONS 569

Table 29.1 Scales and Checklists for Evaluating Research Syntheses Located by the Agency for Healthcare
Research and Quality

Auperin, A., J. P. Pignon, and T. Poynard. 1997. Review Article: Critical Review of Meta-Analyses of Randomized Clinical
Trials in Hepatogastroenterology. Alimentary Pharmacology and Therapeutics 11(2): 21525.
Beck, C. T. 1997. Use of Meta-Analysis as a Teaching Strategy in Nursing Research Courses. Journal of Nursing
Education 36(2): 8790.
Clarke, Mike, and Andy D. Oxman, eds. 1999. Cochrane Reviewers Handbook 4.0. Oxford: The Cochrane Collaboration.
Harbour, R., and J. Miller. 2001. A New System for Grading Recommendations in Evidence-Based Guidelines. British
Medical Journal 323(7308): 33436.
Irwig, L., A.N.A. Tosteson, C. Gatsonis, J. Lau, G. Colditz, T. C. Chalmers, and F. Mosteller. 1994. Guidelines for
Meta-Analyses Evaluating Diagnostic Tests. Annals of Internal Medicine 120(8): 66776.
Khan, K. S., G. Ter Riet, J. Glanville, A. J. Sowden, and J. Kleijnen. 2000. Undertaking Systematic Reviews of Research on
Effectiveness: CRDs Guidance for Carrying Out or Commissioning Reviews. York: University of York, NHS Centre for
Reviews and Dissemination.
Oxman, Andy D., and Gordon H. Guyatt. 1991. Validation of an Index of the Quality of Review Articles. Journal
of Clinical Epidemiology 44(11): 127178.
Sacks, H. S., J. Berrier, D. Reitnam, V. A. Ancona-Berk, and T. C. Chalmers. 1987. Meta-Analyses of Randomized
Controlled Trials. New England Journal of Medicine 316(8): 45055.
Scottish Intercollegiate Guidelines Network. 2001. Sign 50: A Guideline Developers Handbook. Methodology Checklist 1:
Systematic Reviews and Meta-Analyses. Edinburgh: Scottish Intercollegiate Guidelines Network. http://www.sign.ac.uk/
guidelines/fulltext/50/checklist1.html
Smith, Andrew. F. 1997. An Analysis of Review Articles Published in Four Anaesthesia Journals. Canadian Journal of
Anaesthesia 44(4): 4059.
West Suzanne, Valerie King, Timothy S. Carey, Kathleen N. Lohr, Nikki McKoy, Sonya F. Sutton, and Linda Lux. 2002.
Systems to Rate the Strength of Scientific Evidence. Evidence Report/Technology Assessment 47. AHRQ Publication
02-E016. Rockville, Md.: Agency for Healthcare Research and Quality.

that could in turn be used to calculate a numerical score consideration. For example, the identification of informa-
for a synthesis, such that higher scores might indicate tion channels to search to locate studies relevant to a syn-
more trustworthy synthesis. This approach was rejected thesis topic and what terms to use when searching refer-
because of concerns raised about similar scales used to ence databases clearly depend on topic decisions (see
evaluate the quality of primary research (see Valentine chapter 5). The checklist can therefore only suggest that
and Cooper 2008; chapter 7, this volume), in particular, syntheses based on complementary sources and proper
that single quality scores are known to generate wildly and exhaustive database search terms should be consid-
different conclusions depending on what dimensions of ered by consumers as more trustworthy; it cannot specify
quality are included and on how the different items on the what these sources and terms might be. Second, as many
scale are weighted. Also, when summary numbers are of the chapters in this book suggest, a consensus has not
generated, studies with very different profiles of strengths yet emerged on some judgments about the adequacy of
and limitations can receive similar scores, making syn- synthesis methods, even among expert synthesists.
theses with very different validity characteristics appear Finally, a distinction is made in the checklist between
more similar than they actually are. questions that relate to the conduct of research synthesis
Second, some questions on the list lead to clearer pre- in general and to meta-analysis in particular. This is be-
scriptions than others do of what constitutes good synthe- cause not all research areas will have the needed evi-
sist practices. This occurs for two reasons. First, the defi- dence base to conduct a meta-analysis that produces in-
nition will depend at least somewhat on the topic under terpretable results. This does not mean, however, that other
570 SUMMARY

Table 29.2 Checklist of Questions for Evaluating Research Syntheses

Defining the problem


1. Are the variables of interest given clear conceptual definitions?
2. Do the operations that empirically define the variables of interest correspond to the variables conceptual definitions?
3. Is the problem stated so that the research designs and evidence needed to address it can be specified clearly?
4. Is the problem placed in a meaningful theoretical, historical, or practical context?
Collecting the research evidence
5. Were complementary searching strategies used to find relevant studies?
6. Were proper and exhaustive terms used in searches and queries of reference databases and research registries?
7. Were procedures employed to assure the unbiased and reliable application of criteria to determine the substantive
relevance of studies, and retrieval of information from study reports?
Evaluating the correspondence between the methods and implementation of individual studies and the desired inferences of
the synthesis
8. Were studies categorized so that important distinctions could be made among them regarding their research design
and implementation?
9. If studies were excluded from the synthesis because of design and implementation considerations, where these
considerations explicitly and operationally defined, and consistently applied to all studies?
Summarizing and integrating the evidence from individual studies
10. Was an appropriate method used to combine and compare results across studies?
11. If effect sizes were calculated, was an appropriate effect size metric used?
12. If a meta-analysis was performed were average effect sizes and confidence intervals reported, and was an appropriate
model used to estimate the independent effects and the error in effect sizes?
13. If a meta-analysis was performed, was the homogeneity of effect sizes tested?
14. Were study design and implementation features (as suggested by question 8) along with other critical features of
studies, including historical, theoretical and practical variables (as suggested by question 4) tested as potential
moderators of study outcomes?
Interpreting the cumulative evidence
15. Were analyses carried out that tested whether results were sensitive to statistical assumptions and, if so, where these
analyses used to help interpret the evidence?
16. Did the research synthesists discuss the extent of missing data in the evidence base and examine its potential impact
on the synthesis findings?
17. Did the research synthesists discuss the generality and limitations of the synthesis findings?
18. Did the synthesists make the appropriate distinction between study-generated and synthesis-generated evidence when
interpreting the synthesis results?
19. If a meta-analysis was performed, did the synthesists contrast the magnitude of effects with other related effect sizes
and/or present a practical interpretation of the significance of the effects?
Presenting the research synthesis methods and results
20. Were the procedures and results of the research synthesis clearly and completely documented?

source: Cooper 2007. Reprinted with permission from World Education.

aspects of sound synthesis methodology can be ignored wrote, there doesnt seem to be a big role in this kind of
(for example, clear definition of terms, appropriate litera- work for much intelligent statistics, as opposed to much
ture searches). wise thought (182). These authors were correct in em-
In discussing several meta-analyses of the desegrega- phasizing the importance of wisdom in research integra-
tion literature, Kenneth Wachter and Miron Straf pointed tion. Wisdom is essential to any scientific enterprise, but
out that sophisticated literature searching procedures, data wisdom starts with sound procedure. Synthesists must
quality controls, and statistical techniques can bring a re- consider a wide range of technical issues as they piece
search synthesist only so far (1990). Eventually, Wachter together a research domain. This handbook is meant to
POTENTIALS AND LIMITATIONS 571

help research synthesists build well-supported knowledge Cozzens, Susan E. 1981. Users Requirements for Scientific
structures. But structural integrity is a minimum criterion Reviews. Grant report No. IST-7821947. Washington, D.C.:
for synthesis to lead to scientific progress. A more gen- National Science Foundation.
eral kind of design wisdom is also needed to build knowl- Des Jarlais, Don C., Cynthia Lyons, and Nicole Crepaz.
edge structures that people want to view, visit, and, ulti- 2004. Improving the Reporting Quality of Nonrandomized
mately, live within. Evaluations of Behavioral and Public health Interventions:
The TREND statement. American Journal of Public Health
29.6 REFERENCES 94(3): 36166.
DuMouchel, Walter H. 1990. Bayesian Meta-Analysis. In
American Educational Research Association. 2006. Stan- Statistical Methodology in Pharmaceutical Sciences, edi-
dard for Reporting on Empirical Social Science Research ted by Donald A. Berry. New York: Marcel Dekker.
in AERA Publications. Washington, D.C.: American Educa- Eddy, David M., Vic Hasselblad, and Ross Schacter. 1992.
tional Research Association. http://www.aera.net/uploaded Meta-Analysis by the Confidence Profile Method. San
Files/Opportunities/StandardsforReportingEmpiricalSocial Diego: Academic Press.
Science_PDF.pdf Glass, Gene V. 1976. Primary, Secondary, and Meta-Analy-
American Psychological Association (APA) Publications sis of Research. Educational Researcher 5(10): 38.
and Communications Board Working Group on Journal Ar- Hedges, Larry V. 1998. Bayesian Meta-Analysis. In Sta-
ticle Reporting Standards, 2008. Reporting Standards for tistical Analysis of Medical Data: New Developments,
Research in Psychology: Why Do We Need Them? What edited by Brian S. Everitt and Graham Dunn. London:
Might They Be? American Psychologist 63(8): 113. Arnold.
Becker, Betsy J. 1996. The Generalizability of Empirical Kahn, Khalid S., Gerben Ter Riet, Julie Glanville, Amanda J.
Research Results. In From Psychometrics to Giftedness: Sowden, and Jos Kleijnen. 2000. Undertaking Systematic
Papers in Honor of Julian Stanley, edited by Camilla P. Reviews of Research Effectiveness: CRDs Guidance for
Benbow and David Lubinsky. Baltimore, Md.: Johns Hop- Carrying Out or Commissioning Reviews. York: University
kins University Press. of York, NHS Centre for Review and Dissemination.
Boruch, Robert, and George Terhanian. 1998. Cross Design Little, Roderick J.A., and Donald Rubin, B. 2002. Statisti-
Synthesis. In Advances in Educational Productivity, edi- cal Analysis with Missing Data, 2nd ed. New York: John
ted by Douglas M. Windham and David W. Chapman. Lon- Wiley & Sons.
don: JAI Press. Moher, David, Kenneth F. Schulz, and Douglas G. Altman.
Clarke, Mike, and Andy D. Oxman, eds. 1999. Cochrane 2001. The CONSORT Statement: Revised Recommenda-
Reviewers Handbook 4.0. Oxford: The Cochrane Collabo- tions for Improving the Quality of Reports of Parallel-
ration. Group Randomised Trials. Lancet 357(9263): 119194.
Cook, Thomas D., Harris Cooper, David S. Cordray, Heidi Ottenbacher, Kenneth, and Harris Cooper. 1984. The Ef-
Hartmann, Larry V. Hedges, Richard R. Light, Thomas A. fects of Class Placement on the Social Adjustment of Men-
Louis, and Frederick Mosteller. 1992. Meta-Analysis for tally Retarded Children. Journal of Research and Develo-
Explanation: A Casebook. New York: Russell Sage Foun- pment in Education 17: 114.
dation. Oxman, Andy D., and Gordon H. Guyatt. 1991. Validation
Cooper, Harris 1982. Scientific Guidelines for Conducting of an Index of the Quality of Review Articles. Journal of
Integrative Research Reviews. Review of Educational Re- Clinical Epidemiology 44(11): 127178.
search 52(2, Summer): 291302. Peto, Richard. 1987. Why Do We Need Systematic Over-
. 1986. On the Social Psychology of Using Integra- views of Randomized trials (with Discussion). Statistics in
tive Research Reviews: The Case of Desegregation and Medicine 6(3): 23344.
Black Achievement. In The Social Psychology of Educa- Rubin, Donald B. 1987. Multiple Imputation for Nonres-
tion, edited by Robert Feldman. Cambridge: Cambridge ponse in Surveys. New York: John Wiley & Sons.
University Press. Strike, Kenneth, and George Posner. 1983. Types of Syn-
. 2007. Evaluating and Interpreting Research Syn- theses and Their Criteria. In Knowledge Structure and
thesis in Adult Learning and Literacy. Boston, Mass. : Na- Use: Implications for Synthesis and Interpretation, edited
tional College Transition Network, New England Literacy by Spencer Ward and Linda J. Reed. Philadelphia, Pa.:
Resource Center/World Education, Inc. Temple University Press.
572 SUMMARY

U.S. General Accounting Office. 1992. Cross Design Syn- Wachter, Kenneth W., and Miron L. Straf, eds. 1990. The
thesis: A New Strategy for Medical Effectiveness Research. Future of Meta-Analysis. New York: Russell Sage Foun-
GAO/PEMD No. 92-18. Washington, D.C.: U.S. Govern- dation.
ment Printing Office. West, Suzanne, Valerie King, Timothy Carey, Kathleen N.
Valentine, Jeffrey C., and Harris Cooper. 2008. A Systema- Lohr, Nikki McKoy, Sonya F. Sutton, and Linda Lux. 2002.
tic and Transparent Approach for Assessing the Methodo- Systems to Rate the Strength of Scientific Evidence. Evi-
logical Quality of Intervention Effectiveness Research: The dence Report/Technology Assessment 47. AHRQ Publica-
Study Design and Implementation Assessment Device tion No. 02-E016. Rockville, Md.: Agency for Healthcare
(Study DIAD). Psychological Methods 13(2): 13049. Research and Quality.
GLOSSARY

A Posteriori Tests: Tests that are conducted after exami- Artificial Dichotomization: The arbitrary division of scores
nation of study results. Used to explore patterns that seem on a measure of a continuous variable into two categories.
to be emerging in the data. The same as Post Hoc tests. Attenuation: The reduction or downward bias in the ob-
A Priori Tests: Tests that are planned before the examina- served magnitude of an effect size produced by method-
tion of the results of the studies under analysis. The same as ological limitations in a study such as measurement error or
Planned Tests. range restriction.
Aggregate Analysis: The integration of evidence across Available Case Analysis: Also called pairwise deletion, a
studies when a description of the quantitative frequency or method for missing data analysis that uses all available data
level of an event is the focus of a research synthesis. to estimate parameters in a distribution so that, for example,
Agreement Rate: The most widely used index of interra- all available pairs of values for two variables are used to
ter reliability in research synthesis. Defined as the number estimate a correlation.
of observations agreed on divided by the total number of Bayes Posterior Coverage: A Bayesian analogue to the
observations. Also called percentage agreement. confidence interval.
Apples and Oranges Problem: A metaphor for studies Bayess Theorem: A method of incorporating data into a
that appear related, but are actually measuring different prior distribution to obtain a posterior distribution.
things. A label sometimes given as a criticism of meta-anal- Bayesian Analysis: An approach that incorporates a prior
ysis because meta-analysis combines studies that may have distribution to express uncertainty about population pa-
differing methods and operational definitions of the vari- rameter values.
ables involved in the calculation of effect sizes. Best-Evidence Synthesis: Research syntheses that rely on
Artifact Correction: The modification of an estimate of study results that are the best available, not an a priori or
effect size (usually by an artifact multiplier) to correct for idealized standard of evidence.
the effects of an artifact. See also Artifact Multiplier. Between-Study Moderators: Third variables or sets of
Artifact Distribution: A distribution of values for a par- variables that affect the direction or magnitude of relations
ticular artifact multiplier derived from a particular research between other variables. Identified on a between-studies
literature. basis when some reviewed studies represent one level of
Artifact Multiplier: The factor by which a statistical or the moderator and other studies represent other levels. See
measurement artifact changes the expected observed value Within-Study Moderators.
of a statistic. Between-Study Predictors: Measured characteristics of
Artifacts: Statistical and measurement imperfections that studies hypothesized to affect true effect sizes.
cause observed statistics to depart from the population (pa- Between-Study Sample Size: The number of studies in a
rameter) values the researcher intends to estimate. meta-analysis.

573
574 GLOSSARY

Bibliographic Database: A machine-readable file consist- Combined Procedure: A procedure that combines effect
ing at minimum of entries sufficient to identify documents. size estimates based on vote-counting procedures and ef-
Additional information may include subject indexing, ab- fect size estimates based on effect size procedures to obtain
stracts, sponsoring institutions, and so on. Typically can be an overall estimate of the population effect size.
searched on a variety of fields (for example, authors name, Combining Results: Putting effect sizes on a common
keywords in title, year of publication, journal name). See metric (for example, r or d) and calculating measures of
also Reference Database. location (for example, mean, median) or combining tests of
Bibliographic Search: An exploration of literature to find significance.
reports relevant to the research topic. A search is typically Comparative Study: A study in which two or more groups
conducted by consulting sources such as paper indexes, ref- of individuals are compared. The groups may involve dif-
erence lists of relevant documents, contents of relevant jour- ferent participants, such as when the two groups of patients
nals/books, and electronic bibliographic databases. receive two different treatments, or the same group being
Birge Ratio: The ratio of certain chi-square statistics tested or administered treatment at two separate times, such
(used in tests for heterogeneity of effects, model fit, or that as when two diagnostic tests that measure the same param-
variance components are zero) to their degrees of freedom. eter are administered at separate times to the same group.
Used as an index of heterogeneity or lack of fit. Has an ex- Comparisons: Linear combinations of means (of effect
pected value of one under correct model specification or sizes), often used in the investigation of patterns of differ-
when the between-studies variance component is zero. ences among three or more means (of effect sizes). The
Bucks Method: A method for missing data imputation same as Contrasts.
that replaces missing observations with the predicted value Complete Case Analysis: A method for analysis of data
from a regression of the missing variable on the completely with missing observations, which uses only cases that ob-
observed variable; first suggested by Buck (1960). serve all variables in the model under investigation.
Categorical (Grouping) Variable: A variable that can Conceptual (or Theoretical) Definition: A description of
take on a finite number of values used to define groups of the qualities of the variable that are independent of time
studies. and space but which can be used to distinguish events that
Certainty: The confidence with which the scientific com- are and are not relevant to the concept.
munity accepts research conclusions, reflecting the truth Conditional Distribution: The sampling distribution of a
value accorded to conclusions. statistic (such as an effect size estimate) when a parameter of
Citation Search: A literature search in which documents interest (such as the effect size parameter) is held constant.
are identified based on their being cited by other documents. Conditional Exchangeability of Study Effect Size: A
Classical Analysis: A analysis that assumes that parame- property of a set of studies that applies when the investiga-
ter values are fixed, unknown constants and that the data tor has no a priori reason to expect the true effect size of
contain all the information about these parameters. any study to exceed the true effect size of any other study in
Coder: A person who reads and extracts information from the set, given the two studies share certain identifiable char-
research reports. acteristics.
Coder Reliability: The equivalence with which coders ex- Conditional Mean Imputation: Another term for Bucks
tract information from research reports. method or regression imputation, a method for missing data
Coder Training: A process for describing and practicing imputation that replaces missing observations with the pre-
the coding of studies. Modification of coding forms, coding dicted value from a regression of the missing variable on
conventions, and code book may occur during the training the completely observed variable; first suggested by Buck
process. (1960). See also Bucks Method.
Coding Conventions: Specific methods used to transform Conditional Variance: Variability due to sampling error
information in research reports into numerical form. A pri- that associated with estimation of an effect size. Same as
mary criterion for choosing such conventions is to retain as Estimation Variance.
much of the original information as possible. Confidence Interval (CI): The interval within which a
Coding Forms: The physical records of coded items ex- population parameter is expected to lie.
tracted from a research report. The forms, and items on them, Confidence Ratings: A method for coders to directly rate
are organized to assist coders and data entry personnel. their confidence level in the accuracy, completeness, and so
GLOSSARY 575

on, of the data being extracted from primary studies. Can Descriptors: Subject headings used to identify important
establish a mechanism for discerning high-quality from concepts included in a document.
lesser-quality information in the conduct of subsequent Differences Between Proportions: The sample statistic D,
analyses. population parameter . Computed as D  p1  p2 , where p
Construct Validity: The extent to which generalizations is a proportion.
can be made about higher-order constructs on the basis of Disattenuation: The process of correcting for the reduc-
research operations. tion or downward bias in an effect size produced by attenu-
Controlled Vocabulary: The standard terminology de- ation.
fined by an indexing service for a reference database and Effect Size: A generic term that refers to the magnitude of
made available through a Thesaurus. an effect or more generally to the size of the relation be-
Correlation Coefficient: An index that gives the magni- tween two variables. Special cases include standardized
tude of linear relationship between two quantitative vari- mean difference, correlation coefficient, odds ratio, or the
ables. It can range from 1 (perfect negative relationship) raw mean difference.
to 1 (perfect positive relationship), with 0 indicating no Effect-Size Items (coding): Items related to an effect size.
linear relationship. Can include the nature and reliability of outcome an predic-
Correlation Matrix: A matrix (rectangular listing ar- tor measures, sample size(s), means and standard devia-
rangement) of the zero-order correlations among a set of p tions, and indices of association.
variables. The diagonal elements of the correlation matrix Eligibility Criteria: Conditions that must be met by a pri-
equal one because they represent the correlation of each mary study in order for it to be included in the research
variable with itself. Within a single study the off-diagonal synthesis. Also called Inclusion Criteria.
elements are denoted rij, for i, j 1 ti p, i j. The correla- EM Algorithm: An iterative estimation procedure that cy-
tion rij is the correlation between variables i and j. cles between estimation of the posterior expectation (E) of
Covariates: Covariates are variables that are predictive of a random variable (or collection of variables) and the max-
the outcome measure and are incorporated into the analysis imization (M) of its posterior likelihood. Useful for maxi-
of data. In randomized intervention studies, they are some- mum likelihood estimation with missing data.
times used to improve the precision with which key param- Error of Measurement: Random errors in the scores pro-
eters are estimated and to increase the statistical power of duced by a measuring instrument.
significance tests. In nonrandomized studies, they are used Estimation Variance, vi: The variance of the effect size
to potentially reduce the bias caused by confounding, that estimator given a fixed true effect size. The same as Condi-
is, the presence of characteristics associated with both the tional Variance.
exposure variable and the outcome variable. Estimator of Effect Size, Ti: The estimator of the true ef-
Data Coding: The process of obtaining data. In meta- fect size based on collecting a sample data.
analysis it consists of reading studies and recording relevant Evaluation Synthesis: A method of synthesizing evidence
information on data collection forms. about programs and policies developed by the U.S. General
Data Coding Protocol: A manual of instruction for data Accountability Office.
coders, which explains how to code data items, how to Evidence-Based Practices: Programs, products and poli-
handle exceptions, and when to consult the project manager cies that have been shown to be effective using quality re-
to resolve unexpected situations. search designs and methods.
Data Entry: The process of transferring information from Exclusion Criteria: See Ineligibility Criteria.
data collection forms into a database. Experiment: An investigation of the effects of manipulat-
Data Reduction: The process of combining data items in ing a variable.
a database to produce a cases-by-variables file for statisti- Experimental Units: The smallest division of the experi-
cal analysis. mental (or observational) material such that any two units
Density Function: A mathematical function that defines the may receive different treatments.
probability of occurrence of values of a random variable. Explicit Truncation: A process by which all scores in a
Descriptive Analysis: Descriptive statistics that charac- distribution in a certain range (for example, all scores below
terize the results and attributes of a collection of research z  0.10) are eliminated from the data used in the data
studies in an synthesis. analysis. See Range Restriction.
576 GLOSSARY

Exploratory Data Analysis: A descriptive (as opposed to Fugitive Literature: Papers or articles produced in small
inferential) analysis of data that often supplements numeri- quantities, not widely distributed, and usually not listed in
cal summaries with visual displays. commonly used abstracts and indexes. See also Grey Lit-
External Validity: The value of a study or set of studies for erature.
generalizing to individuals, settings, or procedures. Studies Full-Text Database: A machine-readable file that consists
with high external validity can be generalized to a larger num- of the complete texts of documents as opposed to merely
ber of individuals, settings, or procedures than can studies citations and abstracts.
with lower external validity. Largely determined by the sam- Funnel Plot: A graphic display of sample size plotted
pling of individuals, settings, or procedures employed in the against effect size. When many studies come from the same
investigations from which generalizations are to be drawn. population, each estimating the same underlying parameter,
Extrinsic Factors: The characteristics of research other the graph should resemble a funnel with a single spout.
than the phenomenon under investigation or the methods Gaussian Distribution: Data distributed according to the
used to study that phenomenon, for example, the date of normal bell-shaped curve.
publication and the gender of the author. Generalization: Important purpose and promise of a meta-
Falsificationist Approach: A framework that stresses how analysis. Refers to empirical knowledge about a general
secure knowledge depends on identifying and ruling out association. It can involve identifying whether an associa-
plausible alternative interpretations. tion holds with specific populations of persons, settings,
File-Drawer Problem: The situation in which study results times and ways of varying the cause or measuring the ef-
go unreported (and thus are left in file drawers) when tests fect; holds across different populations of people, settings,
of hypotheses do not show statistically significant results. times and ways of operationalizing a cause and effect; or
See also Publication Bias. can be extrapolated to other populations of people, settings,
Fixed Effects: Effects (effect sizes or components of a times, causes and effects than those that have been studied
model for effect size parameters) that are unknown con- to date.
stants. Compare with Random Effects. Generalized Least Squares (GLS) Estimation: An esti-
Fixed-Effects Model: A model for combining effect sizes mation procedure similar to the more familiar ordinary
that assumes all effect sizes estimate a common population least squares method. A sum of squared deviations between
parameter, and that observed effect sizes differ from that paramaters and data is minimized, but GLS allows the data
parameter only by virtue of sampling error. Compare with points for which paramaters are being estimated to have un-
Random-Effects Model. equal population variances and nonzero covariances (that
Flat-File: In data management, a file that constitutes a col- is, to be dependent).
lection of records of the same type that do not contain re- Grey Literature: Literature that is produced on all levels
peating items. Can be represented by two-dimensional of government, academia, business and industry in print
array of data items, that is, a case-by variables data matrix. and electronic formats, but which is not controlled by com-
Footnote Chasing: A technique for discovering relevant mercial publishers. See also Fugitive Literature.
documents by tracing authors footnotes (more broadly, Heterogeneity: The extent to which observed effect sizes
their citations) to earlier documents. Also known as the An- differ from one another. In meta-analysis, statistical tests
cestry Approach. See also Forward Citation Searching. allow for the assessment of whether the variability in ob-
Forward Citation Searching: A way of discovering rele- served effect sizes is greater than would be expected given
vant documents by looking up a known document in a cita- chance (that is, sampling error alone). If so, then the ob-
tion index and finding the later documents that have cited it. served effects are said to be heterogeneous.
See also Footnote Chasing. Heterogeneity-Homogeneity of Classes: The range-simi-
Fourfold Table: A table that reports data concerning the larity of classes of persons, treatments, outcomes, settings,
relationship between two dichotomous factors. See Odds and times included in the studies that have been reviewed.
Ratio, Log Odds Ratio, and Rate Difference. High Inference Codes: Codes of study characteristics that
Free-Text Search: A search in which the computer require the synthesis to infer the correct value. See also Low
matches the users search terms against all words in a docu- Inference Codes.
ment (title, abstract, subject indexing, and full text). The Homogeneity: A condition under which the variability in
search terms may be either controlled vocabulary or natural observed effect sizes is not greater than would be expected
language, but the latter is usually implied. given sampling error.
GLOSSARY 577

Homogeneity Test: A test that a collection of effect size es- agreement (or consistency) for continuous variables. The
timates exhibit greater variability than would be expected if intraclass correlation can be used to isolate different sources
their corresponding effect size parameters were identical. of variation in measurement and to estimate their magni-
Hypothesis: A research problem containing a prediction tude using the analysis of variance. The intraclass correla-
about a particular link between the variablesbased on tion is also used to describe the amount of clustering in a
theory or previous observation. population in clustered or multilevel samples where it is the
Identification Items (coding): The category of coded items ratio of between-cluster variance to total variance.
that document the research reports that are synthesized. Invisible College: A geographically dispersed network of
Author, coder, country, and year and source of publication scientists or scholars who share information in a particular
are typical items. research area through personal communication.
Ignorable Response Mechanism: When data are either Kappa: A versatile family of indices of interrater reliabil-
missing completely at random (MCAR) or are missing at ity for categorical data, defined as the proportion o the best
random (MAR), the mechanism that causes missing data is possible improvement over chance that is actually obtained
called ignorable because the response mechanism does not by the raters. Generally superior to other indices designed
have to be modeled when using maximum likelihood or to remove the effects of chance agreement, kappa is a true
multiple imputation for missing data. reliability statistic that in large samples is equivalent to the
Inclusion Criteria: See Eligibility Criteria. intraclass correlation coefficient.
Indirect Relation: A relationship between two variables Large-Sample Approximation: The statistical theory that
where a third (or more) variable intervenes. A variable can is valid when each study has a large (within-study) sample
have both direct and indirect relations to an outcome; its size. The exact number of cases required to quality as large
indirect relations are achieved by way of one more interme- depends on the effect size index used. In meta-analysis, most
diate (mediator) variables. See Mediator Variable. statistical theory is based on large-sample approximations.
Ineligibility Criteria: Conditions or characteristics that Linked Meta-Analysis: A meta-analysis in which series
render a primary study ineligible for inclusion in the re- of (often bivariate) meta-analyses are used to represent a
search synthesis. The same as Exclusion Criteria. sequence of relationships.
Instrumentalism: A philosophical view that scientific the- Listwise Deletion: A method for analyzing data with miss-
ories or their concepts and measures have no necessary re- ing observations by analyzing only those cases with com-
lation to external validity. plete data; also termed complete case analysis.
Intercoder Correlation: An index of interrater reliability Log Odds: See Logit.
for continuous variables, based on the Pearson correlation Log Odds Ratio: The sample statistic, l, population pa-
coefficient. Used in research synthesis to estimate interater rameter . Computed as the natural logarithm of the odds
reliability, analogous to its use in education to estimate test ratio ln(o). See Fourfold Table.
reliabilities when parallel forms are available. Logistic Regression: A statistical model in which the logit
Interim Report: A report written on completion of a por- of a probability is postulated to be a linear function of a set
tion of a study. A series of interim reports may be generated of independent variables.
during the course of a single study. Logit: The logarithm of the odds value associated with a
Internal Validity: The value of a study or set of studies for probability. If  is a probability, the logit is ln /(1 ),
concluding that a causal relationship exists between vari- where ln denotes natural logarithm.
ables, that is, that one variable affects another. Studies with Logit Transformation: A transformation commonly used
high internal validity provide a strong basis for making a with proportions or other numbers that have values between
causal inference. Largely determined by the control of al- zero and one. The logit is the logarithmic transformation of
ternative variables that could explain the relationship found the ratio [ p/(1  p)], that is, ln [ p/(1  p)].
within a study. Low Inference Codes: Codes of study characteristics that
Interrater Reliability: The extent to which different rat- require the synthesists only to locate the needed informa-
ers rating the same studies assign the same rating to the tion in the research report and transfer it to the synthesis
coded variables. database. See also High Inference Codes.
Intraclass Correlation: Computed as the ratio of the vari- Mantel-Haenszel Statistics: A set of statistics for com-
ance of interest over the sum of the variance of interest plus bining odds ratios and testing the statistical significance of
error, the intraclass correlation is a family of indicies of the resulting average odds ratio.
578 GLOSSARY

Marginal Distribution: The distribution of a statistic (such Missing Data: Data representing either study outcomes or
as an effect size estimate) when relevant parameters (such study characteristics that are unavailable to the synthesist.
as the effect size parameter) are allowed to vary according Model-Based Meta-Analysis: A meta-analysis that ana-
to the prior distribution. lyzes complex chains of events such as the prediction of
Maximum Likelihood: A commonly used method for ob- behaviors based on a set of variables, and models the inter-
taining an estimate for an unknown parameter from a popu- correlations among variables.
lation distribution. Moderator Variable: A variable or set of variables that
Maximum Likelihood Methods for Missing Data: Meth- affect the direction or magnitude of relations between other
ods for analysis of data with missing observations that pro- variables. Variables such as gender, type of outcome, and
vide maximum likelihood estimates of model parameters. other study features are often identified as moderator vari-
Mean of Random Effects: The mean of the true effect ables.
sizes in the random-effects model. Multinomial Distribution: A probability distribution as-
Measurement Error: Random departure of an observed sociated with the independent classification of a sample of
score of an individual from his/her actual (true) score. subjects into a series of mutually exclusive and exhaustive
Sources of measurement error include random response categories. When the number of categories is two, the dis-
error, transient error, and specific factor error. tribution is a binomial distribution.
Mediating Variable: Third variables or sets of variables Multiple Imputation: A method for data analysis with
that provide a causal account of the mechanisms underly- missing observations that generates several possible substi-
ing the relationship between other variables. Transmits a tutes for each missing data point based on models for the
cause-effect relationship in the sense that it is a conse- reasons for the missing data.
quence of a more distal cause and a cause of more distal Multiple Operationalism: The use of several measures
measured effects. A fundamental property of a mediator is that share a conceptual definition but that have different
that it is related to both a precursor and an outcome, while patterns of irrelevant components.
the precursor may show only an indirect relation to the out- Multiple-Endpoint Studies: Studies that measure the out-
come via the mediator. come on the same subject at different points in time.
Meta-Analysis: The statistical analysis of a collection of Nonignorable Response Mechanism: A term that de-
analysis results from individual studies for the purpose of scribes observations that are missing for reasons related to
integrating the findings. unobserved variables. Occurs when observations are miss-
Meta-Regression: The use of regression models to assess ing because of their unknown values or because of other
the influence of variation in contextual, methodological, unknown variables.
participant, and program attributes on effect size estimates. Nonparametric Statistical Procedures: Statistical pro-
Method of Maximum Likelihood: A method of statisti- cedures (tests or estimators) whose distributions do not
cal estimation in which one chooses as the point estimates require knowledge of the functional form of the probabil-
of a set of parameters values that maximize the likelihood ity distribution of the data. Also called distribution free
of observing the data in the sample. statistics.
Method of Moments: A method of statistical estimation Normal Deviate: The location on a standard normal curve
in which the sample moments are equated to their expecta- given in Z-score form.
tions and the resulting set of equations solved for the pa- Normal Distribution: A bell-shaped curve that is com-
rameters of interest. pletely described by its mean and standard deviation.
Metrics of Measurement: The indices used to quantify Not Missing at Random (NMAR): A type of missing data
variables and the relationships between variables. where the probability of observing a value depends on the
Missing at Random: Observations that are missing for value itself; occurs in censoring mechanisms when, for ex-
reasons related to completely observed variables that are ample, high values on a variable are more likely to be miss-
included in the model and not to the value of the missing ing than moderate or low values.
observation. Odds Ratio: A comparison of odds values for two groups.
Missing Completely at Random: Observations that are When the odds ratio is equal to one, the event is equally
missing for reasons unrelated to any variables in the data. likely to occur for both groups.
Missing observations occur as if they were deleted at Odds Value: A probability divided by its complement.
random. If  is a probability, its associated odds value is (1  ).
GLOSSARY 579

Omnibus Tests of Significance: Significance tests that Policy-Driven Synthesis: A synthesis that is requested by
address unfocused questions, as in F tests with more than 1 members of a specific policy-shaping community.
degree of freedom (df) for the numerator or in chi-square Policy-Oriented Quantitative Synthesis: Research syn-
tests with more than 1 df. theses that use aspects of meta-analysis to answer questions
Operational Definition: A definition of a concept that re- about programs, products and policies.
lates the concept to observable events. Pooling Results: Using data or results from several pri-
Operational Effect Size: The effect size actually estimated mary studies to compute a combined significance test or
in the study as conducted; for example, the (attenuated) ef- estimate.
fect size computed using an unreliable outcome measure. Post Hoc Tests: Tests that are conducted after examination
Compare with Theoretical Effect Size. of the data to explore patterns that seem to be emerging in
Overdispersion: A larger than expected variance. the data. The same as A Posteriori Tests.
p-Value: The probability associated with a statistical hy- Posterior Distribution: An updated prior distribution that
pothesis test. The p-value is the probability of obtaining a incorporates observed data (also posterior mean, posterior
sample result at least as large as the one obtained given a variance, etc.)
true null hypothesis (that is, given that sampling error is the Power: See Statistical Power.
only reason that the observed result deviates from the hy- Power Analysis: See Statistical Power Analysis.
pothesized parameter value). Precision: A measure used in evaluating literature searches.
Pairwise Deletion: See available case analysis. The ratio of documents retrieved and deem relevant to total
Parallel Randomized Experiments: A set of randomized documents retrieved. See Recall.
experiments testing the same hypothesis about the relation- Precision of Estimation: Computationally, the inverse of
ship between treatments (independent variables) and out- the variance of an estimator. Conceptually, larger samples
comes (dependent variables). estimate population parameters more precisely than smaller
Parameter: The population value of a statistic. samples. Results from larger samples will have smaller
Partial Relations: Relationships between two variables confidence intervals compared to results from smaller sam-
where a third (or more) variable has been statistically con- ples, all else being equal.
trolled, such as in a multiple regression equation. Pretest-Posttest Design: A study in which participants are
Partial Truncation: A process by which the frequencies tested on the outcome variable, exposed to an intervention,
of scores in certain parts of the range are reduced without and tested again on the outcome variable. Often this design
completely eliminating scores in these ranges. For example, leaves so many validity threats plausible that it is difficult
if one limits the data to high school graduates and above, to interpret the results.
the frequency of people with IQs below 85 will be reduced Primary Study: A report of original research, usually
relative to the entire population. See Range Restriction. published in a technical journal or appearing as a thesis or
Pearson Correlation: See Correlation Coefficient. dissertation; refers to the original research reports collected
Peer Review: The review of research by an investigators for a research synthesis.
peers, usually conducted at the request of a journal editor Prior Distribution: An expression of the uncertainty of
prior to publication of the research and for purposes of parameter values as a statistical distribution before observ-
evaluation to ensure that the research meets generally ac- ing the data (also prior mean, prior variance, etc.).
cepted standards of the research community. Prior Odds: The relative prior likelihood of pairs of pa-
Phi Coefficient (): A measure of association between two rameter values.
binary variables that cross-classified against each other. The Prospective Registration: The compilation of collation of
phi coefficient is calculated as the product moment correla- initiated research projects while they are still ongoing. See
tion coefficient between a pair of binary random variables. Retrospective Registration.
Placebo: A pseudo-treatment administered to subjects in a Prospective Research Synthesis: Eligible studies for a
control group. meta-analysis are identified and evaluated before their find-
Planned Tests: Tests that are planned before the examina- ings are even known.
tion of the results of the data under analysis. The same as A Proximal Similarity: The similarity in manifest charac-
Priori Tests. teristics (for example, prototypical attributes) between
Policy-Shaping Community: The full array of individuals samples of persons, treatments, outcomes, settings, and
and organizations with a vested interest in a policy issue. times and their corresponding universes.
580 GLOSSARY

Proxy Variables: Variables, such as year of study publica- Rank Correlation Test: A nonparametric statistical test in
tion, that serve as surrogates for the true, causal variables, which the correlation between two factors is assessed by
such as developments in research methods or changes in comparing the ranking of the two samples.
data-reporting practices. Rate Difference: See Difference Between Proportions and
Public Policy: A deliberate course of action that affects Fourfold Table.
the people as a whole, now and in the future. Rate RatioRisk Ratio: The ratio of two probabilities,
Publication Bias: The tendency for studies with statisti- RR, is a popular measure of the effect of an intervention.
cally significant results to have a greater chance of being The names given to this measure in the health sciences, the
published than studies with non-statistically significant re- risk ratio or relative risk, reflect the fact that the measure is
sults. Because of this, research syntheses that fail to include not symmetric, but is the ratio of the first groups probabil-
unpublished studies may exaggerate the magnitude of a re- ity of an undesirable outcome to the seconds probability,
lationship. recognizing that the ratio could also be expressed in terms
Quasi-Experiment: An experiment in which the method of the favorable outcome, at least in theory. In fact, in the
of allocation of observations to study conditions is not done health sciences, the term rate is often reserved for quanti-
by chance. See Randomized experiment. ties involving the use of person-time, that is, the product of
Random Effects: Effects that are assumed to be sampled the number of individuals and their length of follow-up.
from a distribution of effects. Compare with Fixed Effects. Raw Mean Difference: The difference between the means
Random Effects in a Regression Model for Predicting Ef- of the intervention and comparison groups.
fect Size: The deviation of study is true effect size from Recall: A measure used in evaluating literature searches.
the value predicted on the basis of the regression model. The ratio of relevant documents retrieved to total docu-
Random-Effects Model: A model for combining effect ments deemed relevant in a collection. See Precision.
sizes under which observed effect sizes may differ from Reduction of Random Effects Variance: The difference
each other because of both sampling error and true vari- between the baseline and the residual random effects vari-
ability in population parameters. ance.
Random-Effects Variance Component: A variance of Reference Database (aka Bibliographic Database): A
the true effect size viewed as either the variance of true repository of citations to documents, for example, Psyc-
effect sizes in a population of studies from which the syn- INFO.
thesized studies constitute a random sample or the degree Regression Discontinuity Design: The allocation of indi-
of the investigators uncertainly about the proposition that viduals, communities, groups, or other units of study to in-
the true effect size of a particular study is near a predicted tervention conditions, using a specific cutoff score on an
value. assignment variable.
Random Assignment: The allocation of individuals, com- Regression Imputation: Also called Bucks (1960) method
munities, groups, or other units of study to an experimental or regression imputation; A method for missing data impu-
intervention, using a method that assures that assignment to tation that replaces missing observations with the predicted
any particular group is by chance. value from a regression of the missing variable on the com-
Random Sampling: The allocation of individuals, com- pletely observed variable; first suggested by Buck (1960).
munities, groups, or other units in a population to a sample, Relative Risk: See Rate Ratio.
using a method that assures that assignment to the sample Relevance: The use of construct and external validity as
is by chance. entry criteria for selecting studies for a synthesis.
Randomized Experiment: An experiment that employs Reliability: Generally, the repeatability of measurement.
random assignment of observations to study conditions. More specifically, the proportion of observed variance of
See Quasi-Experiment. scores on a measuring scale that is not due to the variance
Range Restriction: A situation in a data set in which a pre- of random measurement errors; or the proportion of ob-
selection process causes certain scores or ranges of scores served variance of scores that is due to variance of true
to be missing from the data. For example, all people with scores. Higher values indicate less measurement error and
scores below the mean may have been excluded from the vice versa.
data. Range restriction is produced by both Explicit trunca- Replication: The repetition of a previously conducted
tion and Partial truncation. study.
GLOSSARY 581

Research Problem: A statement of what variables are to tors, phrase-binding, word-stemming devices, wild-card
be related to one another and an implication of how the re- characters, and so on.
lationship can be tested empirically. Search Terms: Words used to identify the key aspects of a
Research Quality: Factors such as study design or imple- subject for use in searching for relevant documents.
mentation features that affect the validity of a study or set Segmentation: Subclassifying study results to reflect im-
of studies. portant methodological or policy-relevant issues.
Research Register: A database of research studies (planned, Selection Bias: An error produced by systematic differ-
active, and/or completed), usually oriented around a com- ences between individuals, entities, or studies selected for
mon feature of the studies such as subject matter, funding analysis and those not selected.
source, or design. Sensitivity Analysis: An analysis used to determine
Research Synthesis: A review of a clearly formulated whether and how sensitive the conclusions of the analysis
question that uses systematic and explicit methods to iden- are violations of assumptions or decision rules used by the
tify, select and critically appraise relevant research, and to synthesist. For example, a sensitivity analysis may involve
collect and analyze data from the studies that are included eliminating dependent effect sizes to see if the average ef-
in the synthesis. Statistical methods (meta-analysis) may or fect size changes.
may not be used to analyze and summarize the results of Sign Test: A test of whether the probability of getting a
the included studies. Systematic review is a synonym of positive result is greater than chance (that is, 50/50).
research synthesis. Significance Level: See p-Value.
Residual Random-Effects Variance: The residual vari- Significance Test: See Statistical Significance Test.
ance of the true random effects after taking into account a Single-Case Research: Research that studies how individ-
set of between-studies predictors. ual units change over time, and what causes this change.
Response Mechanism: Term used to describe the hypo- Single-Value Imputation: A method for missing data
thetical reasons for missing data. analysis that replaces missing observations with one value
Retrieved Studies: Primary studies identified in the litera- such as the complete case mean or the predicted value from
ture search phase of a research synthesis and whose re- a regression of variables with missing data on variables
search report is obtained in full by the research synthesis completely observed.
investigators. Square Mean Root Sample Size: An average sample
Review-Generated Evidence: In a meta-analysis, evi- size for a group of studies with unequal sample sizes. It
dence that arises from studies that do not directly test the is not as influenced by extreme values as the arithmetic
relation being considered. Compare with Study-Generated mean is.
Evidence. Standardized Mean Difference: An effect size that ex-
Robust Methods: Techniques that are not overly sensitive presses the difference (in standard deviation units) between
to the choice of a prior or sampling distribution or to outli- he means of two groups. The d-index is one standardized
ers in the data. mean difference effect size.
Robustness: Consistency in the direction and/or magni- Statistical Power: The probability that a statistical test
tude of effect size for studies involving different classes of will correctly reject a false null hypothesis.
persons, treatments, settings, outcomes, or times. Statistical Power Analysis: A statistical analysis that esti-
Samples: Constituents or members of a universe, class, mates the likelihood of obtaining a statistically significant
population, category, type, or entity. relation between two variables, given assumptions about
Sampling Error: The difference between an effect size the sample size employed, the desired type I error rate, and
statistic and the population parameter that it estimates that the true magnitude of the relationship.
occurs due to employing a sample instead of the entire Statistical Significance Test: A statistical test that gives
population. evidence about the plausibility of the hypothesis of no rela-
Search Strategy: The formal query that retrieves docu- tionship between variables (that is, the null hypothesis).
ments through term-matching. Consists of controlled vo- Stem-and-Leaf Display: A histogram in which the data
cabulary or natural language or both, and may make use of values themselves are used to characterize the distribution
logical operators such as OR, AND, and NOT. Natural-lan- of a variable. Useful for describing the center, spread, and
guage searches may further include word proximity opera- shape of a distribution.
582 GLOSSARY

Study Descriptors: Coded variables that describe the char- Test of Significance: See Statistical Significance Test.
acteristics of research studies other than their results or out- Theoretical Effect Size: The effect size that would be ob-
comes. For example, the nature of the research design and tained in a study that differs in some theoretically defined
procedures used, attributes of the subject sample, and fea- way from one that is actually conducted. Typically, theo-
tures of the setting are all study descriptors. retical effect sizes are estimated from operational effect
Study Protocol: A written document that defines the con- sizes with the aid of auxiliary information about study de-
cepts used in the research and that describes study proce- signs or measurement methods. For example, the (disat-
dures in as much detail as possible before the research be- tenuated) correlation coefficient corrected for the effects of
gins. measurement unreliability is a theoretical effect size. Com-
Study-Generated Evidence: In a meta-analysis, evidence pare to Operational Effect Size.
that arises from studies that directly test the relation being Theory Testing: The empirical validation of hypothesized
considered. For example, several studies might compare relationships, which may involve the validation of a single
the relative effects of attending summer school (compared relationship between two variables or a network of rela-
to control students) on boys versus girls. A meta-analysis of tions between several variables. Often validation depends
these comparisons is considered study generated evidence. on observing the consequences implied by, but far removed
Compare with Synthesis-Generated Evidence. from, the basic elements of the theory.
Subject Subsamples: Groupings of a subject sample by Thesaurus: A list of subject terms in the controlled vo-
some characteristic for example, gender or research site. cabulary authorized by subject experts to index the subject
Substantive Factors: Characteristics of research studies content of citations in a database.
that represent aspects of the phenomenon under investiga- Threat to Validity: A likely source of bias, alternative ex-
tion, for example, the nature of the treatment conditions planation, or plausible rival hypothesis for the research
applied and the personal characteristics of the subjects findings.
treated. Time Series Design: Research design in which units are
Substantively Irrelevant Study Characteristic: Charac- tested at different times, typically equal intervals, during
teristics of studies that are irrelevant to the major hypothe- the course of a study.
ses under study, but are inevitably incorporated into re- Total Variance of an Effect Size Estimate: The sum of
search operations to make them feasible. The analysts goal the random effects variance and the estimation (or fixed ef-
is to show that tests of the hypothesis are not confounded fects) variance. The same as Unconditional Variance.
with such substantive irrelevancies. Transformation: The application of some arithmetic prin-
Sufficient Statistics: Term used for the statistics needed ciple to a set of observations to convert the scale of mea-
to estimate a parameter; the sum of the values of a variable surement to one with more desirable characteristics, for
is the sufficient statistic for computing the mean of a distri- example, normality, homogeneity, or linearity.
bution. Treatment-by-Studies Interaction: The varying effect of
Synthesis-Generated Evidence: In a meta-analysis, evi- a treatment across different studies; equivalently, the effect
dence that arises from studies that do not directly test the of moderators on the true effect size in a set of studies of
relation being considered. For example, some studies might treatment efficacy.
examine the effect of attending summer school for low- Trial Register: A database of research studies, planned,
SES students, while others might examine this relation for active and or completed, usually oriented around a com-
middle-SES students. A meta-analysis of the differential ef- mon feature of the studies such as the subject matter, fund-
fectiveness of summer school for low vs. middle-SES stu- ing source or design.
dents is considered synthesis generated evidence because Trim and Fill Method: A funnel-plot-based method of
the studies do not directly test this relation. Compare with testing and adjusting for publication bias in meta-analysis.
Study Generated Evidence. True Effect Size: The effect magnitude that an investiga-
Synthetic Correlation Matrix: A matrix of correlations, tor seeks to estimate when planning a study.
each synthesized from one or more studies. True Score: Conceptually, the score an individual would
Systematic Artifacts: Statistical and measurement imper- obtain in the complete absence of measurement error. A
fections that bias observed statistics (statistical estimates) true score is defined as the average score that a particular
in a systematic direction (either upward or downward). See individual would obtain if he or she could be independently
Artifacts. measured with an infinite number of equivalent measures,
GLOSSARY 583

with random measurement error being eliminated by the Sii2 for variables 1 to p. The covariances Sij2 for i, j  1 to p,
averaging process. i  j, are the off-diagonal elements of the matrix.
Unconditional Mean Imputation: Also called mean im- Vote-Counting Procedure: is a procedure in which one
putation, a method for missing data analysis that replaces simply tabulates the number if studies with significant pos-
missing observations with the complete case mean for that itive results, the number with significant negative results,
variable. and the number with nonsignificant results. The category
Unconditional Variance: An estimate of the total variabil- with the most votes presumably provides the best guess
ity of a set of observed effect sizes, presumed to be due to about the direction of the population effect size. More mod-
both within-study and between-study variance. ern vote-counting procedures provide an effect size esti-
Unit of Analysis: The study unit that contributes to the mate and confidence interval.
calculation of an effect size and its standard error within a Weighted Least Squares Regression in Research Synthe-
study. This unit may be a single individual or a collection of sis: A technique for estimating the parameters of a mul-
individuals, like a classroom or dyad. tiple regression model wherein each studys contribution to
Universe: The population (or hyperpopulation) of studies the sum of products of the measured variables is weighted
to which one wished to generalize. by the precision of that studys effect size estimator.
Unpublished Literature: Papers or articles not made Weighting: The process of allowing observations with
available to the public through the action of a publisher. certain desirable characteristics to contribute more to the
Unsystematic Artifacts: Statistical imperfections, such as estimation of a parameter. In meta-analysis, each studys
sampling error, which cause statistical estimates to depart contribution is often weighted by the precision of that
randomly from the population parameters that a researcher studys effect size estimator. Occasionally study results are
intends to estimate. See Artifacts. weighted by different factors, such as study quality.
Utility: The extent to which a synthesis provides a full Within-Study Moderators: Moderators are identified on
array of policy questions. a within-study basis when two or more levels of the mod-
Validity Threat: See Threat to Validity. erator are present in individual reviewed studies. See Be-
Variance Components: The sources into which the vari- tween-Studies Moderators and Moderator.
ance of effect sizes is partitioned. Also, the random variables Within-Study Sample Size: The number of units within a
whose variances are the summands for the total variance. primary study.
Variance-Covariance Matrix: A matrix (rectangular list- Z-Transformed Correlation Coefficient: The sample
ing arrangement) that contains the variances and covari- statistic z, population parameter . The Fisher Z transfor-
acnes for a set of p variables. Within a single study the di- mation of a correlation r is .5 ln[(1 r)/(1  r)].
agonal elements of the covariance matrix are the variances
APPENDIX A

DATA SETS

Three data sets have been used repeatedly throughout this Several methodological characteristics of studies that
handbook. This appendix provides brief descriptions of might mediate the gender difference in conformity were
the data sets, references for the research studies that re- examined by Eagly and Carli (1981) and by Betsy J.
ported the primary data analyses, and integrative research Becker (1986). Two of them are reported here: the per-
reviews that examined the studies. centage of male authors and the number of items on the
instrument used to measure conformity. Eagly and Carli
1. DATA SET I: STUDIES OF GENDER DIFFERENCES reasoned that male experimenters might engage in sex
IN CONFORMITY USING THE FICTITIOUS NORM typed communication that transmitted a subtle message
GROUP PARADIGM to female subjects to conform. Consequently, studies with
a large percentage of male experimenters would obtain a
Alice Eagly and Linda Carli (1981) conducted a com- higher level of conformity from female subjects. Becker
prehensive review of experimental studies of gender dif- reasoned that the number of items on the measure of con-
ferences in conformity. The studies in this data set are a formity might be related to effect size for two reasons:
portion of the studies that Eagly and Carli examined, the Because each item is a stimulus to conform, instruments
studies that they called other conformity studies. Be- with a larger number of items produce a greater press
cause all of these studies use an experimental paradigm to conform; instruments with a larger number of items
involving a nonexistent norm group, we call them ficti- would also tend to be more reliable, and hence they would
tious norm group studies. be subject to less attenuation due to measurement unreli-
These studies measure conformity by examining the ability.
effect of knowledge of other peoples responses on an A summary of the ten studies and the effect size esti-
individuals response. Typically, an experimental subject mates calculated by Eagly and Carli, and in some cases
is presented with an opportunity to respond to a question recalculated by Becker, is given in table A.1. The ef-
of opinion. Before responding, the individual is shown fect size is the standardized difference between the
some data on the responses of other individuals. The mean for males and that for females, and a positive value
data are manipulated by the experimenters, and the implies that females are more conforming than males.
other individuals are the fictitious norm group. For ex- The total sample size (the sum of the male and female
ample, the subject might be asked for an opinion on a sample sizes), the percentage of male authors, and the
work of art and told that 75 percent of Harvard under- number of items on the outcome measure are also
graduates liked the work a great deal. given.

585
586 APPENDIX A

Table A.1 Data Set I: Studies of Gender Differences in Conformity

Total Effect Male


Sample Size Authors Items
Study Size Estimate (%) (n)

King 1959 254 .35 100 38


Wyer 1966 80 .37 100 5
Wyer 1967 125 .06 100 5
Sampson and Hancock 1967 191 .30 50 2
Sistrunk 1971 64 .69 100 30
Sistrunk and McDavid 1967 [1] 90 .40 100 45
Sistrunk and McDavid 1967 [4] 60 .47 100 45
Sistrunk 1972 20 .81 100 45
Feldman-Summers et al. 1977 [1] 141 .33 25 2
Feldman-Summers et al. 1977 [2] 119 .07 25 2

note: These data are from Eagly and Carli 1981 and Becker 1986.

Table A.2 Data Set II: Studies of the Effects of Teacher Expectancy on Pupil IQ

Estimated Weeks of
Teacher-Student Contact Standard
Study Prior to Expectancy Induction d Error

Rosenthal et al. 1974 2 .03 .125


Conn et al. 1968 21 .12 .147
Jose and Cody 1971 19 .14 .167
Pellegrini and Hicks 1972 [1] 0 1.18 .373
Pellegrini and Hicks 1972 [2] 0 .26 .369
Evans and Rosenthal 1968 3 .06 .103
Fielder et al. 1971 17 .02 .103
Claiborn 1969 24 .32 .220
Kester 1969 0 .27 .164
Maxwell 1970 1 .80 .251
Carter 1970 0 .54 .302
Flowers 1966 0 .18 .223
Keshock 1970 1 .02 .289
Henrikson 1970 2 .23 .290
Fine 1972 17 .18 .159
Greiger 1970 5 .06 .167
Rosenthal and Jacobson 1968 1 .30 .139
Fleming and Anttonen 1971 2 .07 .094
Ginsburg 1970 7 .07 .174

note: These data are from Raudenbush and Bryk 1985; see also Raudenbush 1984.

2. DATA SET II: STUDIES OF THE EFFECTS OF (see also Raudenbush and Bryk 1985). The experiments
TEACHER EXPECTANCY ON PUPIL IQ usually involve the researcher administering a test to a
sample of students. A randomly selected portion of the
Stephen Raudenbush (1984) reviewed randomized experi- students (the treatment group) are identified to their
ments on the effects of teacher expectancy on pupil IQ teachers as likely to experience substantial intellectual
APPENDIX A 587

Table A.3 Data Set III: Validity Studies Correlating Student Ratings of the Instructor with Student Achievement

Study Sample Ability Control na r

Bolton et al. 1979 General psychology Random assignment 10 .68


Bryson 1974 College algebra CIAT pretest 20 .56
Centra 1977 [1] General biology Final exam for prior 13 .23
Centra 1977 [2] General psychology semester of same course 22 .64
Crooks and Smock 1974 General physics Math pretest 28 .49
Doyle and Crichton 1978 Introductory communications PSAT verbal score 12 .04
Doyle and Whitely 1974 Beginning French aptitude test Minnesota Scholastic 12 .49
Elliot 1950 General chemistry College entrance exam scores 36 .33
Ellis and Richard 1977 General psychology Random assignment 19 .58
Frey, Leonard and Beatty 1975 Introductory calculus SAT math scores 12 .18
Greenwood et al. 1976 Analytic geometry and calculus Achievement pretest 36 .11
Hoffman 1978 Introductory math SAT math scores 75 .27
McKeachie et al. 1971 General psychology Intelligence test scores 33 .26
Marsh et al. 1956 Aircraft mechanics Hydraulics pretest 121 .40
Remmer et al. 1949 General chemistry Orientation test scores 37 .49
Sullivan and Skanes 1974 [1] First-year science Random assignment 14 .51
Sullivan and Skanes 1974 [2] Introductory psychology Random assignment 40 .40
Sullivan and Skanes 1974 [3] First-year math Random assignment 16 .34
Sullivan and Skanes 1974 [4] First-year biology Random assignment 14 .42
Wherry 1952 Introductory psychology Ability scores 20 .16

note: These data are from Cohen 1983; studies with sample sizes less than ten were excluded.
a
Number of sections.

growth. All students are tested again, and the standard- 3. DATA SET III: VALIDITY STUDIES CORRELATING
ized difference between the mean treatment group and STUDENT RATINGS OF THE INSTRUCTOR WITH
that of the other students is the effect size. The theory STUDENT ACHIEVEMENT
underlying the treatment is that the identification of stu-
dents as showing promise of exceptional growth changes Peter Cohen reviewed studies of the validity of student
teacher expectations about those students, which may in ratings of instructors (1981, 1983). The studies were usu-
turn effect outcomes such as intelligence (possibly through ally conducted in multisection courses in which the sec-
intermediate variables). Raudenbush proposed that an tions had different instructors but all sections used a com-
important characteristic of studies that was likely to be mon final examination. The index of validity was a
related to effect size was the amount of contact that teach- correlation coefficient (a partial correlation coefficient,
ers had with the students prior to the attempt to change controlling for a measure of student ability) between the
expectations (1984). He reasoned that teachers who had section mean instructor ratings and the section mean ex-
more contact would have already formed more stable amination score. Thus, the effective sample size was the
expectations of students-expectations that would prove number of sections. Table A.3 gives the effect size (prod-
more difficult to change via the experimental treatment. uct moment correlation coefficient) and a summary of
A summary of the studies and their effect sizes is given in each study with a sample size of ten or greater.
Table A.2. The effect size is the standardized difference
between the means of the expectancy group and the con- 4. REFERENCES
trol group on an intelligence test. Positive values of the
effect size imply a higher mean score for the experimental Becker, Betsy J. 1986. Influence Again. In The Psychology
(high-expectancy) group. of Gender: Progress Through Meta-Analysis, edited by
588 APPENDIX A

Janet S. Hyde and Marcia L. Linn. Baltimore: Johns Hop- tive Conditions. Journal of Personality and Social Psycho-
kins University Press. logy 6(3): 25563.
Cohen, Peter A. 1981. Student Ratings of Instruction and
Student Achievement: A Meta-Analysis of Multisection
Validity Studies. Review of Educational Research 51(3): 4.2 Data Set II: Studies Reporting Primary Results
281309.
. 1983. Comment on A Selective Review of the Va- Carter, D. L. 1971. The Effect of Teacher Expectations on
lidity of Student Ratings of Teaching. Journal of Higher the Self-Esteem and Academic Performance of Seventh
Education 54: 44958. Grade Students Dissertation Abstracts International 31
Eagly, Alice H., and Linda L. Carli. 1981. Sex of Resear- (University Microfilms No. 7107612): 4539A.
chers and Sex Typed Communication as Determinants of Claiborn, William. 1969. Expectancy Effects in the
Sex Differences in Influenceability: A Meta-Analysis of Classroom: A Failure to Replicate. Journal of Educational
Social Influence Studies. Psychological Bulletin 90: 120. Psychology 60(5): 37783.
Raudenbush, Stephen W. 1984. Magnitude of Teacher Ex- Conn, L. K., C. N. Edwards, Robert Rosenthal, and D.
pectancy Effects on Pupil IQ as a Function of the Credibi- Crowne. 1968. Perception of Emotion and Response to
lity of Expectancy Induction: A Synthesis of Findings from Teachers Expectancy by Elementary School Children.
18 Experiments. Journal of Educational Psychology 76(1): Psychological Reports 22(1): 2734.
8597. Evans, Judith, and Robert Rosenthal. 1969. Interpersonal
Raudenbush, Stephen W., and Anthony S. Bryk. 1985. Em- Self-Fulfilling Prophecies: Further Extrapolations from the
pirical Bayes Meta-Analysis. Journal of Educational Sta- Laboratory to the Classroom. Proceedings of the 77th An-
tistics 10(2): 7598. nual Convention of the American Psychological Associa-
tion 4: 3712.
4.1 Data Set I: Studies Reporting Primary Results Fielder, William R., R. D. Cohen, and S. Feeney. 1971. An
Attempt to Replicate the Teacher Expectancy Effect. Psy-
Feldman-Summers, Shirley, Daniel E. Montano, Danuta chological Reports 29(3, pt. 2): 122328.
Kasprzyk, and Beverly Wagner. 1977. Influence When Fine, L. 1972. The Effects of Positive Teacher Expectancy
Competing Views Are Gender Related: Sex as Credibility. on the Reading Achievement of Pupils in Grade Two. Dis-
Unpublished manuscript, University of Washington. sertation Abstracts International 33(University Microfilms
King, B. T. 1959. Relationships Between Susceptibility to No. 7227180): 1510A.
Opinion Change and Child Rearing Practices. In Perso- Fleming, Elyse, and Ralph Anttonen. 1971. Teacher Expec-
nality and Persuasibility, edited by Carl I. Hovland and tancy or My Fair Lady. American Educational Research
Irving L. Janis. New Haven: Yale University Press. Journal 8(2): 24152.
Sampson, Edward E., and Francena T. Hancock. 1967. An Flowers, C. E. 1966. Effects of an Arbitrary Accelerated
Examination of the Relationship Between Ordinal Position, Group Placement on the Tested Academic Achievement of
Personality, and Conformity. Journal of Personality and Educationally Disadvantaged Students. Dissertation Abs-
Social Psychology 5(4): 398407. tracts International 27(University Microfilms No. 6610288):
Sistrunk, Frank. 1971. Negro-White Comparisons in Social 991A.
Conformity. Journal of Social Psychology 85: 7785. Ginsburg, R. E. 1971. An Examination of the Relationship
. 1972. Masculinity-Femininity and Conformity. Between Teacher Expectations and Student Performance
Journal of Social Psychology 87(1): 1612. on a Test of Intellectual Functioning. Dissertation Abs-
Sistrunk, Frank, and John W. McDavid. 1971. Sex Variable tracts International 31(University Microfilms No. 710922):
in Conformity Behavior. Journal of Personality and Social 3337A.
Psychology 17(2): 2007. Grieger, R. M., II. 1971. The Effects of Teacher Expectan-
Wyer, Robert S., Jr. 1966. Effects of Incentive to Perform cies on the Intelligence of Students and the Behaviors of
Well, Group Attraction, and Group Acceptance on Confor- Teachers. Dissertation Abstracts International 31(Univer-
mity in a Judgmental Task. Journal of Personality and So- sity Microfilms No. 710922): 3338A.
cial Psychology 4(1): 2126. Henrikson, H. A. 1971. An Investigation of the Influence of
. 1967. Behavioral Correlates of Academic Achie- Teacher Expectation upon the Intellectual and Academic
vement: Conformity under Achievement-AffiliationIncen- Performance of Disadvantaged Children. Dissertation
APPENDIX A 589

Abstracts International 31(University Microfilms No. Crooks, T. J., and H. R. Smock. 1974. Student Ratings of
7114791): 6278A. Instructors Related to Student Achievement. Urbana: Uni-
Jose, Jean, and John Cody. 1971. Teacher-Pupil Interaction versity of Illinois, Office of Instructional Resources.
as It Relates to Attempted Changes in Teacher Expectancy Doyle, Kenneth O., and Leslie I. Crichton. 1978. Student,
of Academic Ability Achievement. American Educational Peer, and Self-Evaluations of College Instructors. Journal
Research Journal 8: 3949. of Educational Psychology 70(5): 81526.
Keshock, J. D. 1971. An Investigation of the Effects of the Doyle, Kenneth O., and Susan Whitely. 1974. Student Ra-
Expectancy Phenomenon upon the Intelligence, Achieve- tings as Criteria for Effective Teaching. American Educa-
ment, and Motivation of Inner-City Elementary School tional Research Journal 11(3): 25974.
Children. Dissertation Abstracts International 32(Univer- Elliott, D. N. 1950. Characteristics and Relationships of Va-
sity Microfilms No. 7119010): 243A. rious Criteria of College and University Teaching. Purdue
Kester, S. W. 1969. The Communication of Teacher Expec- University Studies in Higher Education 70: 561.
tations and Their Effects on the Achievement and Attitudes Ellis, Norman R., and Henry C. Rickard. 1977. Evaluating
of Secondary School Pupils. Dissertation Abstracts Inter- the Teaching of Introductory Psychology. Teaching of Psy-
national 30(University Microfilms No. 6917653): 1434A. chology 4(3): 12832.
Maxwell, M. L. 1971. A Study of the Effects of Teachers Frey, Peter W., Dale W. Leonard, and William Beatty. 1975.
Expectations on the IQ and Academic Performance of Chil- Student Ratings of Instruction: Validation Research.
dren. Dissertation Abstracts International 31(University American Educational Research Journal 12(4): 43547.
Microfilms No. 7101725): 3345A. Greenwood, G. E., A. Hazelton, A. B. Smith, and W. B. Ware.
Pellegrini, Robert, and Robert Hicks. 1972. Prophecy Ef- 1976. A Study of the Validity of Four Types of Student
fects and Tutorial Instruction for the Disadvantaged Child. Ratings of College Teaching Assessed on a Criterion of
American Educational Research Journal 9(3): 41319. Student Achievement Gains. Research in Higher Educa-
Rosenthal, Robert, S. Baratz, and C. M. Hall, 1974. Tea- tion 5(2): 17178.
cher Behavior, Teacher Expectations, and Gains in Pupils Hoffman, R. Gene. 1978. Variables Affecting University
Rated Ccreativity. Journal of Genetic Psychology 124(1): Student Ratings of Instructor Behavior. American Educa-
11521. Rosenthal, Robert, and Lenore Jacobson. 1968. tional Research Journal 15(2): 28799.
Pygmalion in the classroom. New York: Holt, Rinehart and McKeachie, W. J., Y. G. Lin, and W. Mann. 1971. Student
Winston. Ratings of Teacher Effectiveness: Validity Studies. Ameri-
can Educational Research Journal 8(3): 43545.
4.3 Data Set III: Studies Reporting Primary Morsh, Joseph E., George G. Burgess, and Paul N. Smith.
Results 1956. Student Achievement as a Measure of Instructor Ef-
fectiveness. Journal of Educational Psychology 47: 7988.
Bolton, Brian, Dennis Bonge, and John Man. 1979. Ratings Remmers, H. H., F. D. Martin, and D. N. Elliott. 1949. Are
of Instruction, Examination Performance, and Subsequent Students Ratings of Instructors Related to Their Grades?
Enrollment in Psychology Courses. Teaching of Psycho- Purdue University Studies in Higher Education 66: 1726.
logy 6(2): 8285. Sullivan, Arthur M., and Graham R. Skanes. 1974. Validity
Bryson, Rebecca. 1974. Teacher Evaluations and Student of Student Evaluation of Teaching and the Characteristics
Learning: A Reexamination. Journal of Educational Re- of Successful Instructors. Journal of Educational Psycho-
search 68(1): 1214. logy 66(4): 58490.
Centra, John A. 1977. Student Ratings of Instruction and Wherry, R. J. 1952. Control of Bias in Ratings (PRS Reports
Their Relationship to Student Learning. American Educa- Nos. 914, 915 919, 920, and 921). Washington, D.C.: De-
tional Research Journal 14(1): 1724. partment of the Army, Adjutant Generals Office.
APPENDIX B

TABLES

591
592 APPENDIX B

Table B.1 Probability that a Standard Normal Deviate Exceeds Z

Second Digit of Z

Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
.3 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379
1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170
1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985
1.3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823
1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681
1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559
1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455
1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367
1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294
1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233
2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183
2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143
2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110
2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084
2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064
2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048
2.6 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036
2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026
2.8 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019
2.9 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .0014
3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010
3.1 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .0007
3.2 .0007
3.3 .0005
3.4 .0003
3.5 .00023
3.6 .00016
3.7 .00011
3.8 .00007
3.9 .00005
4.0 .00003

source: Reproduced from. S. Siegel, Nonparametric Statistics (New York: McGraw-Hill, 1956), p. 247, with the permission of the publisher.
APPENDIX B 593

Table B.2 Critical Values of the Chi-Square Distribution

Left Tail Area


Degrees of
Freedom 0.50 0.75 0.90 0.95 0.975 0.99 0.995

1 0.4549 1.323 2.706 3.841 5.025 6.635 7.879


2 1.386 2.773 4.605 5.991 7.378 9.210 10.60
3 2.366 4.108 6.251 7.815 9.348 11.34 12.84
4 3.357 5.385 7.779 9.488 11.14 13.28 14.86
5 4.351 6.626 9.236 11.07 12.83 15.09 16.75
6 5.348 7.841 10.64 12.59 14.45 16.81 18.55
7 6.346 9.037 12.02 14.07 16.01 18.48 20.28
8 7.344 10.22 13.36 15.51 17.53 20.09 21.95
9 8.343 11.39 14.68 16.92 19.02 21.67 23.59
10 9.342 12.55 15.99 18.31 20.48 23.21 25.19
11 10.34 13.70 17.28 19.68 21.92 24.72 26.76
12 11.34 14.85 18.55 21.03 23.34 26.22 28.30
13 12.34 15.98 19.81 22.36 24.74 27.69 29.82
14 13.34 17.12 21.06 23.68 26.12 29.14 31.32
15 14.34 18.25 22.31 25.00 27.49 30.58 32.80
16 15.34 19.37 23.54 26.30 28.85 32.00 34.27
17 16.34 20.49 24.77 27.59 30.19 33.41 35.72
18 17.34 21.60 25.99 28.87 31.53 34.81 37.16
19 18.34 22.72 27.20 30.14 32.85 36.19 38.58
20 19.34 23.83 28.41 31.41 34.17 37.57 40.00
21 20.34 24.93 29.62 32.67 35.48 38.93 41.40
22 21.34 26.04 30.81 33.92 36.78 40.29 42.80
23 22.34 27.14 32.01 35.17 38.08 41.64 44.18
24 23.34 28.24 33.20 36.42 39.36 42.98 45.56
25 24.34 29.34 34.38 37.65 40.65 44.31 46.93
26 25.34 30.43 35.56 38.89 41.92 45.64 48.29
27 26.34 31.53 36.74 40.11 43.19 46.96 49.64
28 27.34 32.62 37.92 41.34 44.46 48.28 50.99
29 28.34 33.71 39.09 42.56 45.72 49.59 52.34
30 29.34 34.80 40.26 43.77 46.98 50.89 53.67
31 30.34 35.89 41.42 44.99 48.23 52.19 55.00
32 31.34 36.97 42.58 46.19 49.48 53.49 56.33
33 32.34 38.06 43.75 47.40 50.73 54.78 57.65
34 33.34 39.14 44.90 48.60 51.97 56.06 58.96
35 34.34 40.22 46.06 49.80 53.20 57.34 60.27
36 35.34 41.30 47.21 51.00 54.44 58.62 61.58
37 36.34 42.38 48.36 52.19 55.67 59.89 62.88
38 37.34 43.46 49.51 53.38 56.90 61.16 64.18
39 38.34 44.54 50.66 54.57 58.12 62.43 65.48
40 39.34 45.62 51.81 55.76 59.34 63.69 66.77
50 49.33 56.33 63.17 67.50 71.42 76.15 79.49
60 59.33 66.98 74.40 79.08 83.30 88.38 91.95
70 69.33 77.58 85.53 90.53 95.02 100.4 104.2
80 79.33 88.13 96.58 101.9 106.6 112.3 116.3
90 89.33 98.65 107.6 113.1 118.1 124.1 128.3
100 99.33 109.1 118.5 124.3 129.6 135.8 140.2

(Continued)
594 APPENDIX B

Table B.2 (Continued )

Left Tail Area


Degrees of
Freedom 0.50 0.75 0.90 0.95 0.975 0.99 0.995

110 109.3 119.6 129.4 135.5 140.9 147.4 151.9


120 119.3 130.1 140.2 146.6 152.2 159.0 163.6
130 129.3 140.5 151.0 157.6 163.5 170.4 175.3
140 139.3 150.9 161.8 168.6 174.6 181.8 186.8
150 149.3 161.3 172.6 179.6 185.8 193.2 198.4
160 159.3 171.7 183.3 190.5 196.9 204.5 209.8
170 169.3 182.0 194.0 201.4 208.0 215.8 221.2
180 179.3 192.4 204.7 212.3 219.0 227.1 232.6
190 189.3 202.8 215.4 223.2 230.1 238.3 244.0
200 199.3 213.1 226.0 234.0 241.1 249.4 255.3

source: Reprinted from Hedges and Olkin with permission of the publisher.

Table B.3 Fishers z Transformation of r

Second Digit of r

r .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

.0 .000 .010 .020 .030 .040 .050 .060 .070 .080 .090
.1 .100 .110 .121 .131 .141 .151 .161 .172 .182 .192
.2 .203 .213 .224 .234 .245 .255 .266 .277 .288 .299
.3 .310 .321 .332 .343 .354 .365 .377 .388 .400 .412
.4 .424 .436 .448 .460 .472 .485 .497 .510 .523 .536
.5 .549 .563 .576 .590 .604 .618 .633 .648 .662 .678
.6 .693 .709 .725 .741 .758 .775 .793 .811 .829 .848
.7 .867 .887 .908 .929 .950 .973 .996 1.020 1.045 1.071
.8 1.099 1.127 1.157 1.188 1.221 1.256 1.293 1.333 1.376 1.422

Third Digit of r

r .000 .001 .002 .003 .004 .005 .006 .007 .008 .009

.90 1.472 1.478 1.483 1.488 1.494 1.499 1.505 1.510 1.516 1.522
.91 1.528 1.533 1.539 1.545 1.551 1.557 1.564 1.570 1.576 1.583
.92 1.589 1.596 1.602 1.609 1.616 1.623 1.630 1.637 1.644 1.651
.93 1.658 1.666 1.673 1.681 1.689 1.697 1.705 1.713 1.721 1.730
.94 1.738 1.747 1.756 1.764 1.774 1.783 1.792 1.802 1.812 1.822
.95 1.832 1.842 1.853 1.863 1.874 1.886 1.897 1.909 1.921 1.933
.96 1.946 1.959 1.972 1.986 2.000 2.014 2.029 2.044 2.060 2.076
.97 2.092 2.109 2.127 2.146 2.165 2.185 2.205 2.227 2.249 2.273
.98 2.298 2.323 2.351 2.380 2.410 2.443 2.477 2.515 2.555 2.599
.99 2.646 2.700 2.759 2.826 2.903 2.994 3.106 3.250 3.453 3.800

source: Reprinted from Snedecor and Cochran with permission of the publisher.
Ir
note: z is obtained as $ In, ____
Ir
INDEX

Numbers in boldface refer to figures and tables.

abstract databases, 76 Amouyel, Philippe, 458 arc sine square root transformations
Aby, Stephen H., 78 analysis of covariance (ANCOVA), of proportions, 36465
Ackoff, Russell, 52 22830, 230 Armitage, Peter, 246
Acton, Gayle, 163 analysis of data: aggregate analysis, artifact multiplier (a), 31921, 324,
ad hoc imputation, 4068, 41314 2627; artifact handling, 32431; 32829, 33031
AERA (American Educational Re- descriptors, 15254; exploratory, artifacts: coding, 199; correlation
search Association), 565 42023; opportunities in coding, analysis methods, 32427; descrip-
AgeInfo, 115 15257; overview, 1314, 4446. tors as clues to, 151; distribution
Ageline database, 95 See also interpretation; meta- analysis methods, 32731; intro-
Agency for Healthcare Research and analysis; missing data duction, 318; and odds ratio, 250;
Quality (AHRQ), 47576, 568 analysis of variance (ANOVA), 39, summary, 33132; systematic,
aggregate analysis, 2627 190, 191, 28088. See also fixed- 31924; and theoretical vs. opera-
agreement rate (AR), 187, 19192, effects models tional effect sizes, 41, 42; unsys-
192, 195 ANCOVA (analysis of covariance), tematic, 31819
Ahn, Soyeon, 380 22830, 230 association, 25, 2729, 30, 391, 543
AHRQ (Agency for Healthcare Re- Aos, Steve, 482, 484 49. See also causal relationship;
search and Quality), 47576, 568 APA (American Psychological Asso- correlations
Albarracn, Dolores, 5012 ciation), 183, 565 Association of College and Research
Alcohol Concern, 113 a posteriori assessment. See post hoc Libraries, 117
Allison, Paul, 4045, 413 assessment attenuation artifacts, 31924, 325,
Aloe, Ariel, 394 apples and oranges problem, 553 32831
alternative candidate selection model, a priori hypotheses: biases in, 4243, attitudinal information, 26, 27
444 308; and homogeneity in sample, attrition, sample, 134, 136, 13839,
Amemiya, Takeshi, 409 39; vs. post hoc accounts of effect 140, 549
American Educational Research As- size relations, 462; and quality audience consideration: for literature
sociation (AERA), 565 of primary research, 13234, review, 5, 6, 53; policy-oriented re-
American Faculty Directory, 61 13539 search synthesis, 484; and quality
American Psychological Association AR (agreement rate), 187, 19192, of research synthesis, 568; report-
(APA), 183, 565 192, 195 ing format, 522, 525

595
596 INDEX

augmentation of data, 412 between-studies mean squares bivariate analysis, 23753, 244, 265
Australian Policy Online, 115 (BMS), 190, 191 69, 38084. See also effect-size
author search, 89 bias-corrected standardized mean dif- parameters
availability bias, 105 ference (Hedges g), 22628 Blitstein, Jonathan, 264, 354
available-case analysis, missing data, biases: in a priori hypotheses, 4243, blocking, 553
4046, 407, 408, 409 308; attenuation, 31824, 325, BMS (between-studies mean
averaged ratings, 18384 32831; from attrition, 549; avail- squares), 190, 191
ability bias, 105; from avoiding Bond, Charles, 459
Balay, Robert, 78 homogeneity test result, 55354; Bonferroni method, 287, 288, 390,
bare-bones meta-analysis, 318, 328, citation bias, 105; coding, 179, 546
330 18081, 197, 199, 551; in con- books, database searches of, 78, 92
Bates, Marcia, 64 founding of moderator variables, Boole, George, 84
Baxter, Pam M., 73101 55253; from constant correction boolean logic, 8485
Bayesian analyses: and dependencies factor, 244; and consultation as Boole-Bonferroni inequality, 368
in effect sizes, 54647; and multi- search resource, 62; cost, 105; de- bootstrap variance estimation method,
ple imputation method for missing scriptors as clues to, 151; dissemi- 242
data, 412; and odds ratio, 244; for nation bias, 106; duplication bias, Borenstein, Michael, 22135, 400
publication bias adjustment, 446; 105; effect-size estimation, 135, Borman, Geoffrey, 419, 485, 487,
and random-effects models, 298, 175, 446, 545, 546; exclusion rules 488, 497519
3056, 307, 30812; and results all to remove, 13536; familiarity bias, Bornstein, Brian, 155
in the same direction, 219; and sen- 105, 106; in footnote chasing, 60; box and whisker plot, 5034
sitivity analysis, 42628; and un- in generalized inferences, 55051; box (schematic) plots, 42123, 502,
certainty about variances, 298, 307, hindsight, 155; language bias in 503
567 search sources, 67, 9293, 105, Breslow, Norman, 246
Becker, Betsy J., 37795, 412, 554 106, 400; masking of coders to pre- British Library, 67, 111, 113
Beery, John, 380 vent, 17475, 181, 199, 55152; Brown, Charles, 245
Begg, Colin, 501 and nonrandom assignment, 132 Brown, Sharon, 163, 379
Beretvas, Natasha, 388 33, 139, 260; North American bias browsing as search method, 59, 65
Berlin, Jesse, 23753, 266 in search sources, 6667; pipeline Bryk, Anthony, 26364, 3023, 305,
Bernier, Charles, 52 bias effect, 442; in policy-oriented 307, 308, 341
best-evidence synthesis approach, research syntheses, 476, 47880; Buck, S. F., 4078
486, 487, 512 prestige factor in, 436; reactivity, Bullock, R. J., 180, 183
beta (), 246 55152; and reporting, 400, 402, Burdett, Sarah, 119
between-coders mean squares (CMS), 436, 437, 448, 461, 52728; re- Burke, Michael, 327, 329
190, 191 trieval bias, 428; sampling, 545, Burton, Nancy, 189
between-group differences, comput- 54950; sources of research syn- Bush, Robert, 7
ing, 22425, 22627, 228, 282 thesis, 1056; time-lag bias, 105, Bushman, Brad J., 20720
between-studies comparisons: Birge 448; transformation bias, 546; vi- Byar, David, 245
ratio and variance explanation, 292; sual representation of, 5012. See
and causal links, 554; meta-regres- also artifacts; missing data; publi- C2 Reviews of Interventions, and Pol-
sion adjustment approach, 48788; cation bias; selection bias icy Evaluations (C2-RIPE), 95
and model-based meta-analysis, bibliographic databases. See data- C2 Social, Psychological, Education,
388, 39091; moderator variable bases, reference and Criminological Trials Registry
effects, 463, 464; and policy- Biological Abstracts, 113 (C2-SPECTR), 96, 108, 116, 485
oriented research synthesis, 484 BiomedCentral, 111, 116 Callender, John, 327, 329
85; prior distributions specification, Biondi-Zoccai, Giuseppe, 88 Camilli, Gregory, 47980
42728; in random-effects models, Birge, Raymond, 7 Campbell, Donald, 13031, 539, 540,
27075; and statistical power of Birge (RB) ratio, 29293 556
meta-analysis, 548 Bishop, Yvonne, 242 Campbell, Keith, 468, 544
INDEX 597

Campbell Collaboration (C2): certainty-uncertainty: alternative esti- cluster randomized designs. See
C2-RIPE, 95; C2-SPECTR, 96, 108, mation approaches, 31214; and nested designs
116, 485; and evidence-based prac- Bayesian analysis, 298, 307, 567; CMS (between-coders mean squares),
tices, 481, 48384; and forest plot, coder uncertainty, 178; and com- 190, 191
505; level-of-evidence approach at, petitive theory testing, 46769; CNSTAT (Committee on National
486; as major literature search conclusion, 47071; defined, 456; Statistics), 476
source, 58, 87; overview, 10; policy directing of subsequent research Coalition for Evidence-Based Policy,
role of, 477; profile of, 5556 through, 46970; introduction, 482
Campbell Library, 95 45657; mediator evaluation, 465 Cochran, William G., 7
capitalizing-on-chance problem, 545 67; moderator variable evaluation, Cochrane, Archie, 53
Carli, Linda, 154, 197 46265; and sample size, 260; vari- Cochrane Central Register of Con-
Carlin, John, 426 able relations estimation, 45762; trolled Trials (CENTRAL), 96,
Carlson, Kevin, 32627 visual presentation of, 500. See 108, 116, 48081
Carlson, Michael, 33 also fixed-effects models; random- Cochrane Centre, 10
case-control studies in epidemiology, effects models Cochrane Collaboration: eligibility
486 CERUK (Current Educational Re- criteria requirements, 163; and evi-
categorical data: classification codes, search in the UK), 117 dence-based practices, 48081,
6364; coding protocol develop- Chalmers, Thomas, 132, 181 48384; as major literature re-
ment, 16366, 16768, 16871; ef- Chan, An-Wen, 400401 source, 58, 8788; overview, 10;
fect sizes for, 15657, 23753; and Chan, Wai, 38990, 392, 393 policy role of, 477; profile of,
fixed-effects models, 280; and het- change within unit or across units, 5355; on recordkeeping and dis-
erogeneity, 263; interrater reliabil- 2526, 2930 closure, 94
ity indices, 18689; in model-based Chaon-Moscoso, Salvador, 546 Cochrane Database of Abstracts of
meta-analysis, 39293; visual dis- characteristics, study. See descriptive Reviews of Effects (DARE), 9697
play of, 5024 data Cochrane Database of Systematic Re-
categorical vs. continuous dependent Charlton, Kelly, 1089 views, 96
variables, 280 Cheung, Mike, 38990, 392, 393 Cochrane Handbook for Systematic
causal relationship: between-studies ChildData, 115 Reviews of Interventions, 54, 88,
comparisons, 554; building expla- Chinese Social Sciences Citation 89, 94, 522
nations, 54243; and design of Index, 67 Cochrane Highly Sensitive Search
study, 2729, 549; group-level Chinn, Susan, 354 Strategy, 107
associations, 2729, 30; and inter- chi-square test of homogeneity, 389 Cochrane Library, 96, 115, 505,
nal validity, 134; mediators as 90, 391 517n5
mechanism of, 457, 46567; and chunking method, 81, 82 coder drift, 19293, 551
model-driven synthesis, 392; and CINAHL, 113 coders: consensus method, 18485,
moderator variables, 463, 542, citation bias, 105 190; inconsistency among different,
55253; primary research vs. re- citation counts as quality measure, 545; inference levels for, 164;
search synthesis on, 54849, 563; 130 masking of, 17475, 181, 199,
problem formulation, 25, 26, citations, historical increase in, 10 55152; multiple, 189, 551; obser-
2729; and random assignment cri- citation searches, 58, 59, 6567, vational errors, 17879, 181;
terion, 140, 549; robustness of, 8990 specialized, 174, 183; training of,
55051; and standards of evidence, Clarke, Mike, 94, 115, 52134 17375, 182, 551. See also interra-
486; and threats to validity, 539; Clarke, Teresa, 115 ter reliability (IRR)
treatment and outcome class asso- classical analysis with random-effects coding: and biases, 179, 18081, 197,
ciations, 54849 models, 29798 199, 551; classification codes,
censored data situation, 409 classification codes, 6364 6364; double-coding, 171, 174,
CENTRAL (Cochrane Central Regis- clinical trials. See randomized con- 199; effect sizes, 166, 16768,
ter of Controlled Trials), 96, 108, trolled trials (RCTs) 180; eligibility criteria, 16163;
116, 48081 ClinicalTrials.gov, 116 error-control strategies, 18197;
598 INDEX

coding (cont.) The Commercial and Political Atlas assumption, 36768; fixed-effects
errors in, 175, 17879, 181; further (Playfair), 501 models, 28586, 287; and improve-
research recommendations, 197 Committee on National Statistics ments in research synthesis, 562;
99; improvement in, 562; inference (CNSTAT), 476 log odds ratio method, 36364;
levels, 33, 164, 180, 181; introduc- Committee on Scientific and Techni- nested designs, 34445, 352; 95
tion, 160; mechanics of, 16366, cal Communication, 52 percent posterior confidence inter-
16768, 16871; and mediators, Community Preventative Services, val, 3046, 363, 368; random-
46667; and missing data, 544; and 13233 effects models, 27273; and regres-
model-based meta-analysis, 385 Competitive Sport Anxiety Index sion models in multiple-endpoint
87; on multiple measures, 16566; (CSAI), 381 studies, 37173; and uncertainty
overview, 1213; protocol develop- competitive theory testing, 457, estimation approaches, 31214;
ment, 16366, 16768, 16871, 46769 and vote counting, 21013, 214,
18283, 485, 530; quality of, 175, complete-case analysis, missing data, 215
17891, 19799; reliability of, 164, 4034, 4057 confidence ratings, 166, 19396, 197
174, 185, 54445; replicability of, Comprehensive Meta-Analysis soft- conflicts of interest for investigators,
16061; reporting on methods ware, 508, 517n8 484, 533
used, 529; sample items, 170; soft- comprehensiveness. See exhaus- confounding, defined, 245
ware tools, 17173; time and effort tiveness consensus, coder, 18485, 190
challenges of, 175, 199; transpar- Comprehensive School Reform Consolidated Standards of Reporting
ency, 16061; variable identifica- (CSR), 484, 485, 514 Trials (CONSORT), 183, 564
tion and analysis, 14758. See also computer programs. See software constructs, 134, 14849
quality of primary research tools construct validity: and causal relation-
coding forms, 17073 Computer Retrieval of Information on ships, 539, 54043; coding for,
coding manual, 17071 Scientific Projects (CRISP), 113 143; and competitive theory test-
coefficient of log odds difference (), concept-to-operation-fit for variables, ing, 46768; defined, 131, 134; het-
246 2223 erogeneity as threat to, 45960; and
cogency, sensitivity analysis, 418 conceptual redundancy rule, 180 restriction to experimental evi-
Cogprints, 113 conceptual variable definition, 2021, dence, 29
Cohen, Jacob, 225, 515 2224 consultation as search method, 59,
Cohen, Peter, 5089 concordance of multiple raters, 189 6062, 10910, 18182, 400, 544
Cohens d, 22628, 23233 conditional inference model, 3839, context of research: and coding proto-
Cohens , 18789, 19192, 192, 41, 280. See also fixed-effects col development, 164; and interpre-
196 models tation of effect sizes, 234; through
Colditz, Graham, 132 conditional means, imputing missing narrative-quantitative integration,
collection of data. See searching the data with, 4078, 409 513, 516; as variable of interest,
literature conditional posterior mean study ef- 151
collinearity, 29192 fect, 427 continuous data, 22135, 280, 289
Collins, David, 38283 conditional probability (p) for popula- 92, 5045
combining effect sizes: artifacts, 317 tion correlation coefficient (), controlled studies, rate ratio in, 240
33; avoiding dependencies, 546 21718 41. See also randomized controlled
47; computer programs, 275; con- conditional shrinkage estimation, 306 trials (RCTs)
clusions, 275; fixed-effects models, conditional variance in fixed- and ran- controlled vocabulary, 6365, 66, 86,
258, 26170, 27993; interpretive dom-effects models, 26273 92, 93
power of, 539; introduction, 258 conference papers and proceedings, Cook, Deborah, 121
61; random-effects models, 258, 79, 93, 461 Cook, Thomas D., 512, 53760
27075, 295315; reporting con- confidence interval (forest) plot, 437, Coon, Heather, 45960, 470
siderations, 53031; and sensitivity 504, 50510, 5079 Cooper, Harris M., 316, 1935, 58,
analysis, 42328; vote counting confidence intervals: vs. credibility 59, 62, 1089, 479, 480, 488, 511,
procedures, 21416 intervals, 327; equal effect size 51214, 515, 56172
INDEX 599

Copas, John, 431 CRISP (Computer Retrieval of Infor- search process, 8089; stopping
Cordray, David, 166, 180, 185, 186, mation on Scientific Projects), 113 search process, 9495; subject in-
19394, 195, 196, 197, 199, 402, Cronbach, Lee, 297, 475, 556 dexing, 59, 6265, 8384, 85;
47393, 550 Cronbachs generalizability (G) the- underrepresented formats, 9293;
CORK: A Database of Substance ory, 19091, 193 vendors, 7980; vocabulary limita-
Abuse Information, 97 crossover issue and internal validity, tions, 92; Web of Science, 8, 5455,
corrected vs. uncorrected correlations, 134 58, 66, 477, 490n4. See also Camp-
325 cross-product ratio, 266 bell Collaboration (C2); Cochrane
correlational nature of synthesis evi- cross-sectional studies, 24344, 469 Collaboration; MEDLINE database
dence, 563 CSAI (Competitive Sport Anxiety data collection. See searching the
correlation (covariance) matrices: Index), 381 literature
coding issues, 156; and dependent CSR (Comprehensive School Re- data errors, 31819
effect sizes, 36062, 364; and miss- form), 484, 485, 514 data evaluation, 1213, 89. See also
ing data handling, 412; in model- Current Contents alerting service, 60 coding; quality of primary research
based meta-analysis, 379, 38593; Current Controlled Trials metaRegis- date limiting of search project, 78
need for in complex syntheses, 564 ter of Controlled Trials, 116 Dauchet, Luc, 458
correlations: artifact analysis meth- Current Educational Research in the Davey, George, 400
ods, 32427; and competitive the- UK (CERUK), 117 Davidson, David, 67
ory testing, 469; Fisher z-transfor- DEFF (Denmarks Electronic Re-
mation, 46, 26465, 27374, DAI (Dissertation Abstracts Interna- search Library), 117
38891; fixed-effects models, 264 tional), 79 Deffenbacher, Kenneth, 155
65; meta-analysis of corrected, Dallongeville, Jean, 458 delta (d, ) (standardized mean differ-
32429, 33031; and missing data Daniels, Michael, 428 ence). See standardized mean dif-
handling, 4068, 412, 415; phi co- data analysis. See analysis of data ference (d, )
efficient, 24143; rank correlation data augmentation, 412 DeNeve, Kristina, 1089
test, 440, 501; Spearman rank Database of Abstracts of Reviews of dependencies, statistical: and bias in
order, 196; tetrachoric, 258. See Effects (DARE), 9697, 112 meta-analytic results, 175; categor-
also intraclass correlations; Pear- Database of Promoting Health Effec- ical vs. continuous variables, 280;
son correlation coefficient (r) tiveness Reviews (DoPHER), 116 and coding protocol development,
cost bias, 105 databases, coding into electronic, 16566; effect sizes, 44, 35876,
Cottingham, Phoebe, 484, 485, 489 17172, 173 54647; multiple measures chal-
covariance matrices. See correlation databases, reference: abstract, 76; au- lenge, 149
(covariance) matrices thor search, 89; book searching, 78, DerSimonian, Rebecca, 266, 308
covariates, 24548, 25051, 280 92; citation searching, 8990; cost descriptive data: as analysis opportu-
coverage in literature review, 5, 56, of subscribing, 79, 94; documenta- nity, 15254; in citation index
7778, 93 tion of search, 94; grey literature, searches, 66; conceptual framework
Cox and Probit-based effect-size indi- 76, 7879, 9293, 1079, 110, for, 15051; and effect sizes, 152,
ces, 546 11117; and high recall, 5859; 15455; identifying variables of in-
Cozzens, Susan, 568 improvements in, 56465; incon- terest, 148; missing, 4012, 403,
Craft, Lynette, 381, 383, 384, 388, sistent coverage over time, 93; in- 4068, 411; and problem formula-
392 troduction, 7475; manual vs. elec- tion, 25, 2627; in reference data-
Crane, Diane, 61 tronic searches, 9192; multiple bases, 8889; in subject indexing,
credibility of evidence issue, 327, searches, 9091; non-U.S. publica- 64. See also design, research;
485, 48688 tions, 9394; overview, 12; regis- methodology
crime and justice studies, 8189, 120, tries of studies, 436, 54546, 565, design, research: analysis of covari-
15354, 155, 467 566; reporting on search methods, ance, 22830, 230; and causality
Criminal Justice Abstracts, 113 52829; scope of, 7677; search issue, 2729, 549; and competitive
Criminal Justice Periodical Index, strategy, 80; selected databases list, theory testing, 469; criteria for
8384, 85, 97 95100; selecting, 7680; steps in analysis of, 162; cross-sectional
600 INDEX

design, research (cont.) documents: database search types, 54647; descriptive analysis of,
studies, 24344, 469; frameworks 7879; delivery of searched, 67 152, 15455; dichotomous data,
for describing, 15051; indepen- 68; published literature, 6162, 23753; fixed-effects models, 38,
dent groups, 22425, 22628, 228 477, 545; relevance judgment, 67, 40, 41, 258, 26170, 27993, 457
29, 282, 554; matched groups and 13031, 141, 144; subject indexing 58, 460, 547; and funnel plot, 437,
mean differences, 225, 22728, types, 64. See also journals; report- 43839; identifying variables of
229; nested designs, 33855, 462; ing; unpublished literature interest, 14850; improvements in
and odds ratio usage, 243; and Donner, Allan, 338 deriving estimates, 562; indexes,
problem type, 25; prospective vs. DoPHER (Database of Promoting 4445, 46, 223; interpretation of,
retrospective, 540; quality of pri- Health Effectiveness Reviews), 116 234, 51516; methodological im-
mary research, 13132; reporting Doss, Amanda, 46061 pacts on, 165; missing, 400401,
on changes in, 530; retrospective, double-coding, 171, 174, 199 4089, 44344; multivariate nor-
241, 249, 540; time-series, 2526, DrugScope, 113 mal distribution assumption, 411,
2930; two-stage sampling design, DuBois, David, 131 41213; nested designs, 33855,
29798; as variable for research DuMouchel, William, 42728 462; overview, 1314, 224; poten-
synthesis, 148, 23031; weighting duplication bias, 105 tial for bias in, 545; in primary re-
of data by, 260. See also experi- Durantini, Marta, 465 search, 22324, 400401, 548; and
mental studies; nonrandomized Duval, Susan, 429 publication bias, 437; vs. p-values,
studies; randomized controlled 223; random-effects models, 38,
trials (RCTs) EAGLE (European Association for 40, 41, 258, 27075, 295315, 458,
de Smidt, G. A., 119 Grey Literature Exploitation), 104 460, 547; reducing unreliability,
Dialog Information Services, 58 Eagly, Alice, 154, 197, 45572 185; reporting considerations, 135,
dichotomization, 21112, 546 Ebel, Robert, 190 489; resources, 234; selection mod-
dichotomous data, 23753, 244, EBPs (evidence-based practices), eling for, 44447; and statistical
26569. See also effect-size 48082, 48489 conclusion validity, 135; theoretical
parameters Eccles, Jacquelynne, 384 vs. operational, 4142; translation
Dickersin, Kay, 105 EconLIT, 76, 97, 114 into meaningful indicators, 48889;
Dickerson, Sally, 504 Educational Resources Information true vs. estimated, 296, 3034, 307,
Diener, Ed, 469 Center (ERIC), 79, 88, 97, 114, 425; unity of statistical methods,
difference among proportions, 266 120 45; variance calculation overview,
67, 274, 35862 education field: bias in policy-oriented 22324; visual display of, 498
difference between two probabilities, research synthesis, 47980; grey 511; and vote counting, 208, 214
238, 239 literature, 120; meta-analysis in, 7; 16, 217, 219. See also combining
DIMDI service (Deutsches Institut fur No Child Left Behind, 475, 486; effect sizes; mediating effects;
Medizinische Dokumentation und random-effects analysis example, moderator effects; variance
Information), 113 299300; substantive variable Egger, Matthias, 118, 121, 400, 440
Directory Database of Research and analysis, 155 41, 501, 502
Development Activities (READ), Education Line, 113 Ehri, Linnea, 5023
115 effect-size parameters: analysis of re- Elbaum, Batya, 120
disaggregated groups problem, 554 lationships among, 15557, 456; a ELDIS, 115
discriminant validity, 541 priori hypotheses vs. post hoc electronic publishing and availability
dissemination bias, 106. See also pub- accounts of, 462; biases in estima- of information, 436
lication bias tion, 135, 175, 446, 545, 546; cate- electronic vs. manual searching,
Dissertation Abstracts International gorical data, 15657, 23753; 9192
(DAI), 79 coding development, 166, 16768, Elementary and Secondary Education
Dissertation and Theses Database, 180; continuous data, 22135; con- Act, 475, 486
111 version among metrics, 23134; eligibility criteria: and best-evidence
dissertations, accessing, 79 correlations as estimates of, 231; synthesis, 512; bias assessment,
documentation of search, 94 dependence issue, 44, 35876, 501; and coding strategy, 16163,
INDEX 601

180, 384, 545; random assignment evidence-based practices (EBPs), findings, research. See effect-size
as, 14041; reporting on, 136, 527 48082, 48489 parameters
28, 529; and sensitivity analysis, Evidence Network, 111 Finns r, 191
418; and size of study, 447 exact stratified method for odds ratio Fisher, Ronald A., 7, 41, 264, 275,
Ellis, S. H., 505 estimate, 247 539
EM (expectation-maximization) algo- exchangeability assumption and Fishers exact test, 247
rithm, 40912, 411 Bayesian analysis, 298 Fishers z scale, 231
EMBASE, 90, 97 exclusion rules for primary research Fisher z-transformed correlation, 46,
empirical assumption on quality of studies, 13537 26465, 27374, 38891
primary research, 13940 exhaustiveness: grey literature fixed-effects maximum likelihood es-
empirical research: extrapolation, sources, 436; importance of, 106, timator (FEMLE), 309
539, 54142, 552, 555, 556; inter- 485; and leave-one-out method, fixed-effects models: combining ef-
polation, 54142; research synthe- 424, 426, 516; and representative- fect sizes, 258, 26170, 27993;
siss contribution to, 6, 38, 456. See ness, 4344; and sensitivity analy- and conditional inference model,
also experimental studies sis, 420; and stopping literature 3839; effect size estimate role,
EMS (error mean squares), 190, 191 search, 9495 460; and homogeneity, 38, 39, 44,
Environmental Protection Agency expectation-maximization (EM) algo- 280, 28283, 291, 45758, 55354;
(EPA), 47879, 485, 486 rithm, 40912, 411 and model-based meta-analysis,
epidemiological studies, 241, 266, experimental studies: and causality, 379, 38889, 391; need for justifi-
486 2829, 549; and coding protocol cation in use of, 54748; in primary
equal effect sizes and confidence in- development, 165; and eligibility research, 39; Q test, 26263, 265,
terval, 36768 criteria for coding, 162; nested de- 267, 292; vs. random-effects
equal sample sizes in nested designs, signs, 345; and problem type, 25, models, 41, 270, 280, 56667; and
34851 26. See also randomized controlled sensitivity analysis, 42324, 425,
equinormality assumption, 135 trials (RCTs) 42526; and trim-and-fill method,
Erez, Miriam, 465 explanation in problem formulation, 25 444; and universe of generalization,
ERIC (Educational Resources Infor- exploratory data analysis, 42023 38; variance in, 39, 26270,
mation Center), 79, 88, 97, 114, external literature source for informa- 29293
120 tion, 182 flat file approach to data structure,
error (residual) mean squares (EMS), external validity, 29, 13031, 13435, 168
190, 191 460 Fleiss, Joseph, 23753, 267, 274
errors: and chance in meta-analysis, extrapolation, 539, 54142, 552, 555, Fleming, Alexander, 258
545; coding, 175, 17897; data, 556 FMLE (full maximum likelihood esti-
31819; in linear regression test of Eysenbach, Gunther, 110 mator), 30810, 309, 310, 312
publication bias, 441; measure- footnote chasing, 58, 59, 5960, 62
ment, 32223; sampling, 31819, factor analytic models, 380 forest plot, 437, 504, 50510, 5079
326. See also artifacts; standard Fahrbach, Kyle, 388, 38990, 391, formulation of problem. See problem
error-deviation 404, 41112, 413 formulation
ESRC Society Today, 111 fail-safe null method, 44243 forward searches, 58, 63, 66
estimation variance, defined, 242, fail-safe sample size approach, 431 Foster, Craig, 468
270, 298. See also effect-size pa- familiarity bias, 105, 106 foundation of accepted knowledge,
rameters; variance Feldman Kenneth, 7 research synthesis contribution to,
European Association for Grey Liter- FEMLE (fixed-effects maximum like- 470
ature Exploitation (EAGLE), 104 lihood estimator), 309 fourfold tables, 244, 26569, 274
evaluation of data, 1213, 89. See Fidel, Raya, 64 Fowler, Floyd, 26
also coding; quality of primary Fielding, Lori, 78 free-text searching, 92
research Fienberg, Stephen, 242 Freiman, J. A., 505
Evaluation Synthesis methodology, file-drawer problem. See publication frequency of occurrence in descriptive
GAO report on, 476 bias research synthesis, 26, 27
602 INDEX

Friedman, Lynn, 274 prototypical attributes, 550. See 6768; search terms and strategies,
fugitive literature, 104. See also grey also universe of generalization 1067
literature generalized least squares (GLS) Grey Literature Network Service
full maximum likelihood estimator method, 387, 389 (GreyNet), 104, 117
(FMLE), 30810, 309, 310, 312 generative syntheses and subsequent Griffith, Belver, 61
full-text databases, 76 research direction, 46970 Grigg, Jeffrey, 419, 497519
funnel plot: and publication bias, geography of study, 162, 164 group-level associations, 2729, 30
42829, 430, 43740; as visual dis- Gershoff, Elizabeth, 500 group randomized designs. See nested
play of data, 5012, 504, 505 Glanville, Julie, 88 designs
Furlow, Carolyn, 388 Glass, Gene, 6, 7, 8, 119, 139, 178, Grove, William, 192
The Future of Meta-Analysis 180, 18586, 19394, 196, 197, guessing conventions for coding, 178,
(Wachter and Straf), 476 225, 378, 476, 513, 514, 515, 545, 179
future research, research synthesis 553, 556, 562 Guide to Reference Books (Balay), 78
contribution to, 6, 459, 46061, Gleser, Leon J., 35776, 546 Guilbault, Rebecca, 155
46970, 563 GLS (generalized least squares) Gulliford, Martin, 354
method, 387, 389 Guyatt, Gordon, 121
Galbraith, Rex, 505 Goldsmith, Morris, 81
Galbraiths radial plot, 440, 505 Gomersall, Alan, 118, 12021 Haddock, C. Keith, 25777, 489
Gale Directory of Databases, 77, goodness of fit tests, 393 Haenszel, William, 246
106 Google Scholar, 66, 67, 94 Hafdahl, Adam, 380, 504
GAO (Government Accountability Google search engine system, 63 Hahn, Seokyung, 401, 411
Office), 476, 48283, 485, 486, Gorey, Kevin, 119 Hall, Judith, 57, 60, 180
490n1, 8 Gottfredson, Denise, 156 Halvorsen, Katherine, 522
Garfield, Eugene, 52 Government Accountability Office Handbook of Research Synthesis, 8,
Gart, John, 247 (GAO), 476, 48283, 485, 486, 1011
generalizability, 418, 46062, 540 490n1, 8 Handbook of Social Psychology
41, 556, 563, 56768 government agencies and policy- (Mosteller and Bush), 7
generalizability (G) theory, 19091, oriented research syntheses, 478 Hannum, James, 501
193 80, 48182 Hare, Dwight, 25
generalized inferences: and associa- Graham, John, 400 Harris, Alex, 458
tion of treatments and outcomes, Gray-Little, Bernadette, 504 Harris, Jeffrey, 428
54349; attrition problem, 549; Grayson, Lesley, 118, 12021 Harris, Monica, 554
cautions for reporting, 532; conclu- Green, Bert, 57, 60, 180 Hartung, Joachim, 272, 307, 31213
sions, 55557; elements of, 54043; Greenhalgh, Trisha, 59, 107, 108, 109 Hawley, Kristin, 46061
from empirical research, 6; extrapo- Greenhouse, Joel B., 58, 41733 health care field: and certainty evalua-
lation misspecification, 555; failure Greenland, Sander, 138, 246 tion, 458, 466; combining effect
to test for homogeneity, 55354; Greenwald, Anthony, 545 sizes, 25859, 266, 26970, 275;
mediating relationship misspecifica- grey literature: consultation as re- consumer-oriented evaluation de-
tions, 55455; moderator variable source for, 6162; definition of vices, 568; dependent effect sizes,
confounding, 55253; mono- terms, 1045; and exhaustiveness 358; difference among proportions,
method bias, 551; mono-operation of search, 436; health-care field, 35961; EPAs secondhand smoke
bias, 551; and random-effects 11516, 11819; impact on litera- research synthesis, 47879; forest
models, 297, 306; rater drift, 551; ture search, 104, 1056, 11820; plots popularity in, 505; grey liter-
reactivity effects, 55152; and re- information sources, 76, 7879, ature, 108, 11516, 11819; Peto
search synthesis purpose, 53840; 9293, 1079, 110, 11117; intro- method, 24445; policy-oriented
restricted heterogeneity problem, duction, 104; methods, 11015; ne- research syntheses, 47576; publi-
541, 55051, 552; sampling biases, glect of as threat to validity, 461; cation bias identification, 43738,
54950; statistical power prob- problems with, 12021; recommen- 448; regression tests in, 441; re-
lems, 554; underrepresentation of dations, 12122; resources for, porting standards in, 564; and
INDEX 603

search methods, 67, 88; threats to histograms, 503, 504 inclusion rules. See eligibility criteria
validity, 134; treatment effects and HMIC (Health Management Informa- independent groups: between-group
effect sizes, 22223, 36667. See tion Consortium) Database, 113 differences, 22425, 22627, 228,
also Cochrane Collaboration; Hofstede, Geert, 45960 282; disaggregated groups prob-
MEDLINE database Holland, Paul, 242 lem, 554; matched group mean dif-
Health Management Information homogeneity: and a priori hypotheses ferences, 225, 22728, 229
Consortium (HMIC) Database, 113 for sample, 39; chi-square test of, independent vs. dependent study
Health Technology Assessment 38990, 391; failure to test for, statistics, 358. See also bivariate
(HTA) database, 112, 117 55354; and fixed-effects model, analysis; stochastically dependent
Heckman, James, 409 38, 39, 44, 280, 28283, 291, 457 effect sizes
Hedberg, Eric, 354 58, 55354; and fixed- vs. random- index databases, 76
Hedges, Larry V., 316, 3746, 185, effects model choice, 54748; and indexers and controlled vocabulary,
216, 263, 270, 27993, 300, 325, inference power of effect size, 6364
33755, 379, 41819, 424, 426, 45759; and narrative-quantitative Index to Theses, 111
430, 462, 546, 547, 56172 integration, 514; and retrieval of lit- indicator variables, model-based
Hedges g, 22628 erature, 419. See also Q test meta-analysis, 386
Heller, Daniel, 46869 homogeneous effects model, 301, indirect effects in model-based meta-
Helmer, Diane, 108 3012 analysis, 38183, 393
heterogeneity: and combining effect Hopewell, Sally, 10325 indirect range restriction, 322, 323
sizes, 25859, 264, 275; data anal- Horowitz, Ralph, 181, 198 individuals, applying effect-size anal-
ysis considerations, 4445; differ- How Science Takes Stock: The Story ysis to, 34445, 35253
ence among proportions, 26667; of Meta-analysis (Hunt), 8, 476 inferences. See correlations; effect-
difference between two probabili- HTA (Health Technology Assess- size parameters; generalized infer-
ties, 239; and fixed-effects models, ment) database, 112, 117 ences; hypotheses; regression
28283, 54748; and funnel plot, Huber-White robust estimator, 307, analyses
439; importance of seeking in liter- 31314 Inside Web, 111
ature review, 62; and interpretive Hull Family Foundation, 482 Institute of Education Sciences (IES) ,
power of research synthesis, 540; Hunt, Morton, 8, 476 481
and narrative-quantitative integra- Hunter, John E., 7, 8, 308, 321, 325, institutional review boards (IRBs),
tion, 514; partitioning of, 283; and 327, 329, 554 109
phi coefficient problems, 242; and Hyde, Janet, 18182 Institute for Scientific Information
publication bias testing, 441; and hypergeometric distribution, 249 (ISI), 5455, 58, 60, 6667
random-effects models, 38, 3941, hypotheses, 20, 4243, 31214, 544 integrative research review, 6, 13. See
270, 272, 296, 3023, 54748; re- 45, 563. See also a priori hypothe- also research synthesis
striction on universes of data, 541, ses; homogeneity; problem interactions focus, problem formula-
55051, 552; and retrieval of litera- formulation tion, 3033
ture, 419; and robustness, 556; intercoder correlation. See interrater
threats to validity from, 45960, I2 (proportion of total variation due to agreement
539; and weighting of data, 26263 heterogeneity), 263 interlibrary loan (ILL) system, 67, 78,
Hewlett Foundation, 482 identification code, study report, 164 79, 80, 92
Hicks, Diana, 118 identifiers in subject indexing, 64 internal validity, 130, 134, 142
hierarchical approaches, 16870, 175, IDOX Information Service (formerly The International Bibliography of the
186, 347, 54647. See also Bayes- Planex), 115 Social Sciences, 76, 78, 98
ian analyses; random-effects ignorable response mechanisms, 410 International Clinical Trials Registry
models Ilies, Remus, 46869 Platform, 110
Higgins, Julian, 263, 302 ILL (interlibrary loan) system, 67, 78, international materials, searching, 66
high-inference codes, 33 79, 80, 92 67, 78, 110
Hill, Carolyn, 515 imputation strategies, 4068, 409, International Transport Research
hindsight bias, 155 41214, 448, 544 Documentation (ITRD), 114
604 INDEX

Internet as literature resource, 53, 94, Jadad, Alejandro, 175 large-sample approximations, 4546,
110, 436 Janda, Kenneth, 193, 19596 213, 214, 3067
interpolation, 54142 Jennions, Michael, 463 latent variables, 380, 431
interpretation: of effect sizes, 23, Johnson, Knowlton, 38283 Latham, Gary, 465
5155164; missing data handling, Jones, Allan, 189, 19192 Le, Huy, 31733
399416; model-based meta- journals: in database search, 78; miss- Le, Than, 354
analysis, 38793; narrative, 498, ing data from publication practices, least-effort strategies in searching, 57,
51116; overview, 14; and power 402, 447; name search of literature, 58
of research synthesis, 539, 540; 65; and upgrading of reporting re- leave-one-out method, 424, 426, 516
sensitivity analysis, 41733; uncer- quirements, 56465 Lee, Ju-Young, 33
tainty sources, 463. See also publi- Judd, Charles, 383 level-1 random-effects models,
cation bias judgment process: accounting for me- 29899
interquartile range (IQR), 421 diators in, 46667; for coding, level-2 random-effects model, 299
interrater agreement, 180, 18586, 17980, 198; and interpretation of level of evidence approach, 48687
192 effect sizes, 234 Levin, Bruce, 245
interrater reliability (IRR): across the Juffer, Femmie, 45859 Leviton, Laura, 512
board vs. item-by-item agreement, Jni, Peter, 13738, 400 Lewis, Beth, 466
18586; coder drift, 19293; vs. Lewis, J. A., 505
confidence ratings, 19495; hierar- Kaizar, E. E., 427 Lewis, Ruth, 67
chical approaches, 186; indices of, Kalaian, Hripsime, 387 Lexis-Nexis, subject indexing, 8384,
18691; intraclass correlations, kappa (), 18789, 19192, 192, 196 85
19092, 192, 3023; overview, Karau, Steven, 464 librarians as resources, 61, 6263, 67,
174; rationale, 18485; and sensi- Keel, Pamela, 120 106
tivity analysis, 197, 198 Kelley, George, 544 library database resources, 79
intraclass correlations: in interrater re- Kelley, Krisi, 544 Library of Congress Subject Head-
liability (r1), 19092, 192, 3023; Kemeny, Margaret, 504 ings, 64
in nested designs (), 340, 347, 354 Kemmelmeier, Marcus, 45960, 470 Library Use: Handbook for Psychol-
intrarater reliability, 174 Kendall rank correlation, 440 ogy (Reed and Baxter), 7475
INTUTE: Health and Life Sciences, Kenny, David, 383 Light, Richard J., 7, 8, 42, 182, 2089,
117 keywords in subject searching, 6365 428, 476, 49899, 503, 510, 51112
INTUTE: Medicine, 117 King, Laura, 469 likelihood function in vote counting,
invariant sampling frames, 566 Klar, Neil, 338 21011, 212, 214. See also maxi-
investigator-initiated syntheses, Klump, Kelly, 120 mum likelihood estimator (MLE);
48389 Knapp, Guido, 272, 307, 31213 maximum likelihood methods
invisible college, 61, 62 Konstantopoulos, Spyros, 27993, linear models: and hierarchical ap-
IQR (interquartile range), 421 354 proaches, 54647; logistic regres-
IRR (interrater reliability). See inter- Koriat, Asher, 81 sion, 24546; loglinear model, 249;
rater reliability (IRR) Korienek, Gene, 81 mixed-effects, 299, 301, 3046;
irrelevant variables, ruling out, 541 Kumkale, Tarcan, 502 and model-based meta-analysis,
ISI (Institute for Scientific Informa- Kvam, Paul, 430 39192; for publication bias test-
tion), 5455, 58, 60, 6667 ing, 44041, 501. See also random-
ITRD (International Transport Re- LAbb plot, 517n3 effects models
search Documentation), 114 Labordoc, 114 linked meta-analysis, 37879
Iyengar, Satish, 41733 Laird, Nan, 266, 308 Lipsey, Mark W., 119, 14758, 162,
Lambert, Paul, 24647 459, 485, 487, 48889, 515, 549
jackknife variance estimation method, Landenberger, Nana, 153 listwise deletion, 4034, 408
242 language: and bias in search sources, literature review, defined, 47. See
Jackson, Gregg, 78, 54, 57 67, 9293, 105, 106, 400; and eligi- also research synthesis; searching
Jacobson, Linda, 299 bility criteria for meta-analysis, 162 the literature
INDEX 605

Little, Roderick, 402, 406, 4078, MARS (meta-analysis reporting stan- Medical Subject Headings, 64
409, 410, 566 dards), 52324 MEDLINE database: availability of,
Localio, Russell, 250 masking of coders, 17475, 181, 199, 79; coverage of, 77; as grey litera-
Locke, Edwin, 465 55152 ture source, 107; and multiple data-
Loftus, Elizabeth, 81 matched groups, mean differences, base search, 90; subject indexing,
log-binomial models, 250 225, 22728, 229 8384, 85; summary of, 98
logical structure for searching, 8486 matching, controlling covariates by, meta-analysis: artifact distortions,
logistic regression analysis, 24546, 24748 31733; coining of, 378; continu-
247, 250 matrix layout for coding protocol, 171 ous data effect sizes, 22135; criti-
log-likelihood function, 21011, 212, Matt, Georg E., 180, 19798, 53760 cism of, 53, 517; definitional
214 maximum likelihood estimator issues, 67; dependent effect sizes,
loglinear models, 249 (MLE): between-studies variance, 44, 35876, 54647; dichotomous
log-logistic distribution, 428 271; and Cohens Kappa, 189; and data effect sizes, 23753; historical
log odds ratio, 23233, 24647, 266, model-based meta-analysis, 389; perspective, 78, 258, 476; im-
267, 36264 odds ratio for dichotomous data, proving statistical procedures, 565
log ratio, simple, 364 244; and publication bias, 43031; 68; literature growth of, 55, 56, 57;
Lsel, Friedrich, 120, 155 random-effects models, 307, 308 and need for literature reviews,
low-inference coding, 33, 180, 181 12; types, 309, 310, 31011, 312; 5253; nested designs, 33855,
Lu, Congxiao, 401 and vote counting, 21011, 214 462; reporting on statistical meth-
Lyubomirsky, Sonja, 469 15, 217, 219 ods used, 52930; statistical con-
maximum likelihood methods for siderations, 3747; vote counting
MacKinnon, David, 383 missing data, 40912, 413, 415 procedures, 20720. See also com-
McAuley, Laura, 11819 Maxwells RE, 191 bining effect sizes; model-based
McCarty, Carolyn, 120 Maynard, Rebecca, 484 meta-analysis; sampling for
McGaw, Barry, 8, 515 MCAR (missing completely at ran- meta-analysis
McGuire, Joanmarie, 180 dom), 4023, 404, 4056, 409, Meta-Analysis Blinding Study Group,
McHugh, Cathleen, 136 410, 447 175
McKinnie, Tim, 104 mean-effect comparisons, 148, 285, Meta-Analysis for Explanation, 476
McLeod, Bryce, 120 28688. See also standardized meta-analysis reporting standards
McManus, R. J., 61 mean difference (d, ) (MARS), 52324
McPadden, Katherine, 108, 109 means, imputing missing data with, metagraph, 517n5
macrolevel reporting, 402 4068, 409 meta-regression analysis, 289, 486,
main effects: and competitive theory measurement: citation counts as qual- 48788
testing, 46869; and evaluation of ity measure, 130; impact of errors method of moments (MOM), 309,
research analysis, 45662; media- in, 32223; interquartile range 310, 31112, 390
tors of, 46567; moderators of, (IQR), 421; multiple-endpoint stud- The Method of Statistics (Tippett), 7
46263; problem formulation, 26 ies, 36874; multiple measures, methodology: challenges of deficient
30. See also effect-size parameters 14849, 16566, 197; and reactiv- reporting, 17879, 183; and coding
Mallett, Susan, 11516, 118 ity effects, 55152; time factor in, protocol development, 165; criteria
Mann, Thomas, 58, 59, 6263, 74 14950 for analysis of, 162; developments
Mantel, Nathan, 245, 246 mediating effects: and causal explana- in, 56468; intercoder judgments
Mantel-Haenszel method, 246, 247, tions, 542; and certainty-uncer- on quality of, 180; reporting on,
248, 26769, 274 tainty, 46567; and evaluation of 52830; research synthesis analysis
manual search, 9192, 108 research synthesis, 457; misspecifi- of, 46162. See also measurement
MAR (missing at random), 403, 404, cation of, 55455; model-based method variables as descriptors,
4089, 410, 44748 testing of, 38283; post hoc discov- 8889, 148, 151, 15455
Marn-Martnez, Fulgencio, 546 ery of, 46263 microlevel reporting quality, 402
Markov chain Monte Carlo tech- medical sciences field. See health care Miller, James, 132
niques, 412 field Miller, Norman, 33
606 INDEX

Miller, Thomas, 178, 180, 18586, certainty-uncertainty, 46265; multiplier, artifact (a), 31921, 324,
19394, 196, 197, 545 and coding protocol development, 32829, 33031
missing at random (MAR), 403, 404, 164; and competitive theory test- multivariate approach: for dependent
4089, 410, 44748 ing, 468; confounding of, 55253; effect size analysis, 35962, 365
missing completely at random and effect-size estimation, 460; 67, 37173, 37475; and heteroge-
(MCAR), 4023, 404, 4056, 409, and evaluation of research syn- neity, 263; and model-based meta-
410, 447 thesis, 457; and evidence-based analysis, 37982, 388, 567; and
missing data: a priori exclusion of practices, 488; improvements in moderator effect, 553; and weight-
data, 13839; available-case analy- handling, 563; method variables, ing of data, 260
sis, 4046, 407, 408, 409; bias 15455; in model-based meta- multivariate normal distribution, 411,
from, 1056; complete-case analy- analysis, 383, 387, 39293; post 41213
sis, 4034, 4057; effect sizes, hoc discovery of, 46263; and se- Mumford, Emily, 269
400401, 4089, 44344; and eligi- lection modeling, 444; as threat to Munafo, Marcus, 15455
bility criteria for inclusion, 162; validity, 55253; visual display of Murray, David, 338, 354
and improvements in databases, effects, 504 Myers, Greg, 52
56465; imputation strategies, Mller, Anders, 463
4068, 409, 41214, 544; introduc- MOM (method of moments), 309, Najaka, Stacy, 156
tion, 400; maximum likelihood 310, 31112, 390 Nalen, James, 78
methods, 40912, 413, 415; and mono-method bias, 197, 551 narrative interpretation, 498, 51116
mediation effects, 555; model- mono-operation bias, 551 narrative research, 2425
based methods, 384, 40913; in Morphy, Paul, 47393 National Assessment of Educational
primary research, 140, 544; pri- Mosteller, Frederick, 7, 132 Progress (NAEP), 354
mary researcher as source for, 181 Moyer, Christopher, 501 National Criminal Justice Reference
82; and publication bias, 44, 386, Mullen, Edward, 481 Service (NCJRS), 98, 114
400, 44749; recommendations for multifactor analysis of variance, National Institute of Education,
handling, 41315; reporting on 28889 479
handling of, 529; simple methods, multilevel structural equation model- National Library of Medicine (NLM)
4039; statistical improvements in ing programs, 392, 393 Gateway, 111
handling, 565; trim-and-fill method multiple artifacts, 32122 National Reading Panel (NRP), 479
for adjusting, 443; types of, 400 multiple coders, 189, 551 80, 485
402; vote counting procedures, 219 multiple effect sizes and random- National Research Council, 476
missing studies, 400, 424, 443, 485, effects models, 308 National Technical Information Ser-
530 multiple-endpoint studies, 36874 vice (NTIS), 78, 98, 107, 111
misspecification, 44, 3078, 55455 multiple imputation of missing data, natural-language terms, searching
mixed-effects linear model, 299, 301, 41213, 414 with, 63, 64
3046. See also random-effects multiple measures, 14849, 16566, natural units for effect size reporting,
models 197 489
model-based meta-analysis: advan- multiple operationism, 2223 NCJRS (National Criminal Justice
tages of, 38084; analysis and in- multiple raters, concordance of, 189 Reference Service), 98, 114
terpretation, 38793; average cor- multiple ratings in coding, 197 nested designs, 33855, 462
relation matrix, 379, 38485; data multiple regression analysis: for de- Newton-Raphson method, 410
collection, 38485; data manage- pendent effect size analysis, 359 Ng, Thomas, 457, 468
ment, 38587; limitations of, 384; 62, 36567, 37173, 37475; and NHS Economic Evaluation Database
missing data, 40913; overview, fixed-effects models, 280, 28992; (NHS EED), 112
37880; potential for, 56768; in model-based meta-analysis, 95 percent posterior confidence inter-
problem formulation, 384; sum- 37980; standard error-deviation val, 3046, 363, 368
mary, 39394 in, 290, 380 NLH, 117
moderator effects: and causal expla- multiple searches, 9091 NLM (National Library of Medicine)
nations, 463, 542, 55253; and multiple-treatment studies, 35868 Gateway, 111
INDEX 607

NMAR (not missing at random), 403, dependent effect sizes, 36264; and pairwise analysis, 4046, 407, 408,
404, 4089, 448 dichotomous data, 238, 24351, 409
no association test, model-based 266; log odds ratio, 23233, 246 PAIS (Public Affairs Information Ser-
meta-analysis, 391 47, 266, 267, 36264 vice), 114
Noblit, George, 25 (O E)-V method, 24445 Pansky, Ainat, 81
No Child Left Behind, 475, 486 Ogilvie, David, 107 paper forms for coding protocol, 171
NOD, 117 Oh, In-Sue, 31733 Papers First, 79, 98
noncentral hypergeometric distribu- Oliver, Laurel, 186 parallel schematic plot, 5034
tion, 249 Olkin, Ingram, 7, 8, 185, 216, 325, parameters. See effect-size parameters
noncentrality parameter, 263 35776, 387, 389, 41819, 424, parametric selection modeling,
nonempirical information from 426, 462, 546, 547 44447
searches, 8788 Olson, Elizabeth, 81 Parker, Christopher, 156
nonparametric tests for publication OLS (ordinary least squares) regres- partial effects in model-based meta-
bias, 429, 440, 44344 sion and MOM, 311, 312 analysis, 38081
nonrandom assignment, and biases, omega () (odds ratio). See odds participants, study. See sampling, pri-
13233, 139, 260 ratio () mary research
nonrandomized studies: and a priori omnibus tests, fixed-effects models, partitioning of heterogeneity, 283
exclusion of data, 138; bias poten- 29091 Patall, Erika, 488
tial, 139; covariates in, 245; ex- one-factor model for analysis of vari- Paterson, Barbara, 25
cluding studies using, 13536, ance, 28088 path models, 380, 385, 39192. See
14041; and internal validity, 134; Online Computer Library Center also linear models
quasi-experimental studies, 25, (OCLC), 67, 78, 79, 80, 92 Peacock, Richard, 59, 107, 108, 109
260, 345, 549, 564; weighting of online vs. manual searching, 9192 Pearson, Karl, 7, 41, 25859, 260,
data by, 26061 openness to information, 67 476
nonsignificant results: lack of report- open source electronic publishing, 76 Pearson correlation coefficient (r):
ing on, 1056, 400, 401, 402, 403, operational effect size parameters, and artifact analysis, 32427; com-
4089, 436; publication bias on, 4142 bining effect sizes, 265, 27374;
219, 386 operational variable definition, computing, 231, 233; converting
non-U.S. publication, 9394 2024 to-from standard mean difference,
normal quantile plot, 502 ordinary least squares (OLS) regres- 23334; as intercoder correlation,
Normand, Sharon-Lise, 427 sion and MOM, 311, 312 18990, 192; in model-based meta-
NORM freeware, 411, 413 O'Rourke, Keith, 138 analysis, 38788. See also popula-
not missing at random (NMAR), 403, Orwin, Robert G., 166, 177203, 402, tion correlation coefficient ()
404, 4089, 448 550 peer review as sign of quality, 121,
NRP (National Reading Panel), 479 Osburn, Hobart, 327, 329 484
80, 485 Osteem, William, 479 Penrod, Steven, 155, 467
NTIS (National Technical Informa- Ottawa Statement, 110 perspective of literature review, 5, 5, 6
tion Service), 107 Ottenbacher, Kenneth, 563 Peto method, 24445, 247
NY Academy of Medicine: Grey Lit- Ouellette, Judith, 468 Petrosino, Anthony, 118, 486
erature page, 117 outcomes, synthesis. See effect-size Petticrew, Mark, 8
Nye, Barbara, 354 parameters phi () coefficient, 24143
Nye, Chad, 486, 488 outliers, 31819, 421, 422, 423, 458 Pigott, Theresa, 263, 399416
The Oxford Guide to Library Re- Pillemer, David, 8, 42, 182, 428, 476,
observational errors by coders, 178 search (Mann), 74 501, 51112
79, 181 Oyserman, Daphne, 45960, 470 pilot testing of coding protocol, 166,
OCLC (Online Computer Library 168, 18283
Center), 67, 78, 79, 80, 92 Paik, Myunghee C., 245 pipeline bias effect, 442
odds ratio (): conversion to-from paired groups, mean differences, 225, plausible rival hypotheses. See threats
other effect sizes, 23134; and 22728, 229 to validity
608 INDEX

plausible value intervals, 302 post hoc assessment, 13234, 13940, private sector and policy-oriented re-
Playfair, William, 501, 510 287, 46263, 56364 search syntheses, 475, 482
Poelhuis, Caroline, 45859 post hoc hypotheses, 43 probability sampling, 556
point estimation approaches, 3056, Potsdam consultation on meta- probits, 546
307, 30812 analysis, 533 problem formulation: causal relation-
Poisson regression models, 250 power: of effect size, 45759, 539, ships, 25, 26, 2729; defining
policy, applying research synthesis to: 540; primary research elements af- variables, 2024; descriptive re-
best practices overview, 48389; fecting, 548; and quality of primary search, 25, 2627; group-level as-
competing agendas and perspec- research, 139; as threat to validity, sociations, 2729; inference codes,
tives, 476; conclusion, 489; defini- 554. See also certainty-uncertainty; 33; interactions focus, 3033; in-
tional issues, 47475; evidence- robustness troduction, 20; main effect focus,
based practices, 48082; historical precision: and effect size, 223; re- 2630; model-based meta-analysis,
perspective, 8, 47678; identifying search synthesis advantage in, 384; overview, 1112; questions to
policy makers, 47576; introduc- 56263; in searching, 5657, 60 ask, 2426; statistical consider-
tion, 474; investigator-initiated syn- 62, 63, 65, 66; and weighting of ations, 3843; study- vs. synthesis-
theses, 48389; policy-oriented data, 547 generated evidence, 3033; time-
syntheses, 48283; potential biases, predictor reliability, 32324 series studies, 2930; and validity
47880; reporting format consider- Premack, Steven, 544, 554 issue, 34
ations, 522 pre-post scores, 225, 22728 procedures. See methodology
policy-driven research syntheses, prestige factor in reporting bias, 436 Proceedings First, 79, 98
48283 Price, Derek, 6 professional associations, database re-
Policy File, 114 primary research: causal inference in, sources, 79
policy makers, identifying, 47576, 54849, 563; clustering factor ne- proportionate difference: arc sine
489 glect, 340, 347, 354; complexity vs. square root transformations, 364
policy-shaping community, 47576, research synthesis, 562; defining 65; difference among proportions,
482 features of interest, 16163; effect- 26667, 274, 35862; odds ratio-
politics and policy-oriented research size treatment in, 22324, 400401, log odds ratio, 36264; simple
synthesis, 475, 476, 47879 548; exclusion rules for, 13537; ratio-log ratio, 364; total variation
pooled log odds ratio, 24647 extrapolation capabilities, 542; due to heterogeneity, 263
pooled variance, 269 fixed-effects models, 39; generaliz- ProQuest, 79, 8384, 85, 97, 99
population, finite vs. infinite, 314n2. ability in, 54041, 556; importance Prospective Evaluation Synthesis
See also sampling, primary of detail in for effective synthesis, methodology, 483
research 3132; lack of artifact information, prospective registration of studies,
population correlation coefficient (): 319, 327; mediating process analy- 436
and attenuation artifacts, 31924, sis, 467; need for new, 564; nonsig- prospective vs. retrospective study
325, 32831; conditional probabil- nificant results neglect, 1056, 386, design, 540
ity for, 21718; and large-sample 400, 401, 402, 403, 4089, 436; protocol, coding: development of,
approximation, 46, 214; in model- operational definitions in, 2122; 16366, 16768, 16871, 18283,
based meta-analysis, 38889; in and random assignment, 314n1; 485; pilot testing of, 166, 168,
vote counting, 21213, 214, 21516 reporting issues, 17879, 183, 18283; reporting on changes in,
population standardized mean differ- 56465; threats to validity in, 540, 530; revision of, 183
ence. See standardized mean differ- 54344. See also coding; missing prototypical attributes, underrepresen-
ence (d, ) data; quality of primary research; tation of, 550
positive results, vote counting proce- sampling, primary research; search- proximal similarity, 541
dures, 208, 209. See also signifi- ing the literature proxy variables, 463, 464
cant results printed search tools, 66, 75, 7778, PsycARTICLES database, 76, 9293
Posner, George, 568 9192 PsycEXTRA database, 78
posterior vs. prior distributions and prior vs. posterior distributions and psychology-psychotherapy field: and
Bayesian analyses, 42728 Bayesian analyses, 42728 certainty evaluation of research
INDEX 609

synthesis, 45859, 46163, 466; ef- qualitative research, 2425, 53, 401, random-effects models: advantages
fect-size relationships, 156; eligi- 418, 512 of, 3067; application of, 300306;
bility criteria selection, 162; grey quality of primary research: a priori and Bayesian analyses, 3056, 307,
literature resources, 120; meta- strategies, 13234, 13539; de- 30812; Bayesian view, 298; clas-
analysis in, 7; method-variable fined, 130; and design of study, sical view, 29798; combining ef-
analysis, 15455; model-based 13132; grey literature problems, fect sizes, 258, 27075, 295315;
meta-analysis, 38082, 38384, 12021; introduction, 13032; and corrected correlations analysis,
38587, 39293; resources for, journal name search, 65; post hoc 32427; effect size estimate role,
7475 strategies, 13234, 13940; recom- 460; vs. fixed-effects models, 41,
PsycINFO database: basic facts, 114; mendations, 14041, 14243, 144; 270, 280, 54748, 56667; and het-
book-length resources, 78; capabil- vs. relevance, 13031; reporting erogeneity, 26263; and homoge-
ities of, 76, 77; methodology de- deficiencies, 17879, 183; study vs. neity, 270, 271, 458; level-1 model,
scriptors, 88; and reliance on elec- reporting, 132; tools for assessing, 29899; level-2 model, 299; mixed
tronic sources, 75; searching 13234; validity framework, 130 model, 299, 301, 3046; and
example, 8384, 8687; for social 31, 13435; weighting of data by, model-based meta-analysis, 379,
science research synthesis, 90; 26061, 262 388, 39091; point estimation ap-
subject indexing, 8384, 85; quality of research synthesis, 175, proaches, 30812; rationale, 297;
summary, 99 17891, 19799, 56871. See also and sensitivity analysis, 307, 424
PSYNDEX, 114 validity 26, 425; simple, 301, 3023;
Public Affairs Information Service quality scales, 13738, 13940, 261, threats to validity, 3078; and trim-
(PAIS), 114 56869 and-fill method, 444; and uncondi-
publication bias: conclusions, 448 quantitative methods, 78, 2425, tional inference model, 38, 4041;
49; as continuing issue, 565; data 418, 498, 51116. See also uncertainty estimation approaches,
collection considerations, 44; de- meta-analysis 31214; variance in, 26264, 270
fined, 105; fail safe null, 44243; quasi-experimental studies, 25, 260, 74, 29798, 300306, 307. See also
graphical representation of, 428 345, 549, 564 between-studies comparisons
29, 430, 43740, 5012; introduc- quasi-F statistic test, 31213 random effects variance, defined,
tion, 43637; mechanisms, 437; question specification and policy- 29798. See also between-studies
and missing data, 44, 386, 400, oriented research syntheses, comparisons
44749; and nonsignificant results, 48485 randomized controlled trials (RCTs):
219, 386; outcome reporting, 448; Quinn, Jeffrey, 464 Cochrane CENTRAL resource, 96,
selection modeling, 44447; and QUOROM (Quality of Reporting of 108, 116, 48081; and evidence-
sensitivity analysis, 19697, Meta-analysis), 522, 533 based practices, 48081, 482, 486;
42831; statistical tests, 44042, outcome reporting biases, 448
56566; as threat to validity, 540, r (Pearson correlation coefficient). See range restriction, 322, 323, 54344
54546; time-lag bias, 105, 448; Pearson correlation coefficient (r) rank correlation test, 440, 501
trim-and-fill method to adjust for, radial plot, 440, 5045, 506 rate ratio (ratio of two probabilities)
44344; vote counting procedure, Raju, Nambury, 326, 327, 328, 329 (RR), 238, 23941, 24849,
21617, 218 random (unsystematic) artifacts, 25051
public policy. See policy 31819 rater drift, 551
published literature, 6162, 477, 545. random assignment: and causal infer- raters. See coders
See also journals; reporting ence validity, 140, 549; and covari- Raudenbush, Stephen, 26364, 295
p-values vs. effect sizes, 223. See also ates, 245; as eligibility criterion, 315, 387
significant results 14041; and external validity, 134 raw (unstandardized) mean difference
35; and primary research, 314n1; (D), 22425, 27475
Q test: fixed-effects models, 26263, and quality of research, 13132; RB (Birge ratio), 29293
265, 267, 292; model-based meta- and random-effects models, 297; RCTs (randomized controlled trials).
analysis, 38990; Pearson on, 275; and rate ratio, 241; weighting of See randomized controlled trials
random-effects models, 270, 271 data by, 26061 (RCTs)
610 INDEX

reactivity effects, 55152 narrative and meta-analytic integra- effect-size parameters; nonsignifi-
READ (Directory Database of Re- tion, 51116; overview, 6, 1415; cant results; significant results
search and Development Activi- quality of, 132; standards for pri- retrieval bias, 428
ties), 115 mary research, 56465; as variable retrospective study design, 241, 249,
recall factor in searching, 5660, 63, of interest, 151; visual presentation, 540
66 498511. See also publication bias review articles, difficulty in obtaining,
Reed, Jeffrey G., 73101 reports and papers, searching 5253
reference databases. See databases, challenges, 92 Review of Educational Research, 7
reference representativeness, 4344, 9293, 550 rho () (population correlation coeffi-
registries of studies, 110, 436, 545 Research Centers Directory, 61 cient). See population correlation
46, 565, 566 researchers: characteristics, 151; con- coefficient ()
regression analyses: and conditional sultation of, 59, 6062, 10910, Richard, F. D., 459
inference model, 39; and covari- 18182, 400, 544 Ridsdale, Susan, 81
ates, 24546; imputation of miss- research institutions as database risk ratio (rate ratio), 238, 23941,
ing data, 408, 409; logistic, 245 sources, 79 24849, 25051
46, 247, 250; meta-regression, 289, Research Papers in Economics Ritter, Barbara, 464
486, 48788; and method of mo- (Repec), 114 RMLE (restricted maximum likeli-
ments, 311, 312; in model-based research registries, 110, 436, 54546, hood estimator), 309, 310, 31011,
meta-analysis, 385, 39394; in 565, 566 312
multiple-endpoint studies, 37173; research synthesis: and causal rela- Roberts, Helen, 8
for publication bias, 441; and sensi- tionship, 54849, 563; competitive Robins, James, 246
tivity analysis, 428. See also multi- theory testing role, 46769; criteria Robinson, Jorgianne, 488
ple regression analysis for quality assessment, 56871; de- Robinson, Leslie, 401
relative risk (rate ratio), 238, 23941, fined, 67; as empirical research, robustness: causal association, 550
24849, 25051 38; evaluation checklist, 570; and 51; and heterogeneity, 556; and
relevance issue, 67, 13031, 141, 144 generalizability, 53841, 55557; homogeneity testing, 55354;
reliability: and averaged ratings for handbook rationale, 1011; histori- Huber-White estimator, 307,
data, 18384; coding, 164, 174, cal perspective, 78, 10; initial 31314; and moderator effects,
185, 54445; and construct valid- skepticism about, 418; introduc- 552; sensitivity analysis, 418
ity, 134, 137; and exclusion rules, tion, 4; limitations, 56364; litera- Rockefeller Foundation, 482
13637; overview, 6; predictor, ture review definition, 47; meth- Rohrbeck, Cynthia, 155
32324; research synthesis advan- odology developments, 56468; Rolinson, Janet, 67
tage in, 56263; and weighting need for critical evaluation of, 470 Rooney, Brenda, 338
schemes, 26061, 325. See also in- 71; potentials, 56163; as research Rosenthal, Robert, 7, 8, 26, 106, 196,
terrater reliability (IRR) process, 7, 9; stages overview, 11 299, 428, 431, 545, 553, 554
replicability, 4, 105, 16061 15; subsequent research direction Rossiter, Lucille, 46263
reporting: biases, 400, 402, 436, 437, role, 46970; uncertainty in contri- Rothstein, Hannah R., 10325, 400
448, 461, 52728; conclusion, bution of, 45657. See also meta- Rounds, James, 501
51617; of database source infor- analysis; policy; searching the RR (rate ratio), 238, 23941, 24849,
mation, 78; deficiencies in primary literature 25051
research, 17879, 183; of effect- residual (error) mean squares (MSE), Rubin, Donald, 7, 307, 308, 402, 406,
size parameters, 135, 489; on eligi- 190, 191 4078, 409, 414, 566
bility criteria, 136, 52728, 529; restricted maximum likelihood esti- Russell Sage Foundation (RSF),
evidence-based practices, 48889; mator (RMLE), 309, 310, 31011, 47677, 482
on exclusion rules, 136; format 312
guidance, 52134; introduction, restriction of range, 322, 323, 54344 Sachse-Lee, Carole, 120
498; missing descriptor variables, results, study: reporting guidelines Sacks, Harold, 181, 196
4012; as most common form for, 53031; vote counting proce- sampling, primary research: attrition
of scientific communication, 52; dures, 208, 217, 219. See also in, 134, 136, 13839, 140, 549;
INDEX 611

bias in, 437, 54950; and coding Schafer, Joseph, 400, 411, 413, 414 Campbell Collaboration (C2); Co-
protocol development, 16465; and Scheff method, 287, 288 chrane Collaboration; databases,
conditional inference model, 39; schematic (box) plots, 42123, 502, reference; grey literature
eligibility criteria for, 162; error 503 search terms: controlled vocabulary,
considerations, 31819, 326; ex- Schmidt, Frank L., 7, 8, 308, 31733 6365, 66, 86, 92, 93; identifying,
clusion of studies based on size, Schmucker, Martin, 120, 155 8184, 1067; natural-language,
337; and external validity, 13435; Schneider, Barbara, 341 63, 64
and model-based meta-analysis, Schneider, Jeffrey, 479 segmentation scheme for evidence
384, 386; multivariate normal dis- School of Health and Related Re- credibility, 487
tribution assumption for, 411, 412 search (ScHARR), 117 selection bias: in effect-size estima-
13; and phi coefficient problems, Schram, Christine M., 384, 387 tion, 135; and politics of descriptor
24243; and precision of meta- Schulman, Kevin, 250 reporting, 402; publication bias,
analysis, 547; and p-values vs. ef- SCIE (Social Care Institute for Excel- 436, 44447, 448; significant vs.
fect sizes, 223; research synthesis lence), 13334 nonsignificant results, 1056, 400,
power in multiplying, 539; sub- Science Citation Index (SCI), 5455, 401, 402, 403, 4089, 436, 448,
sample results, 149; threats to 56, 90, 99, 112 461
validity from, 54950; and uncon- Science Citation Index Expanded, 8 selection principles in data collection,
ditional inference model, 3941; SCOPUS, 67, 109, 112 4344
and variance estimation for phi, score reliability and construct validity, sensitivity analysis: alternative candi-
242; vote counting and size of 134, 137 date selection model, 444; appen-
sample, 214; weighting by size, Scotts , 191 dix, 431; and coding, 19697; and
25960, 325; within-study effects, screening of studies, resources for, combining effect sizes, 42328;
25960, 264, 26769, 292. See 48586 exploratory data analysis, 42023;
also descriptive data; population search engines, 75, 94 and fixed-effects models, 42324,
correlation coefficient (); search- searching the literature: availability of 425, 42526; introduction, 41819;
ing the literature online sources, 562; browsing, 59, and publication bias, 19697,
sampling for meta-analysis: artifact 65; citation searches, 58, 59, 65 42831; and random-effects mod-
effects on, 31924; and bias in ef- 67, 8990; concept and operation els, 307, 42426, 425; searching
fect sizes, 545; classical view of, considerations, 2324; consultation the literature, 41920; for sub-
29798; coding of items, 170; and method, 59, 6062, 10910, 181 group reporting biases, 448; sum-
data collection process, 4344; 82, 400, 544; document delivery mary, 431; trim-and-fill procedure
equal vs. unequal sizes in nested issue, 6768; exhaustiveness issue, as, 501
designs, 34852; fail-safe sample 436, 485; footnote chasing, 59, 59 setting, study. See context of research
size approach, 431; and fixed- vs. 60; historical perspective, 75; Shadish, William R., 130, 25777,
random-effects models, 25859, knowing when to stop, 556; lan- 401, 456, 489, 540, 554, 556
280; funnel plot assessment, 437, guage bias in, 67, 9293, 105, 106, Shannon, Claude, 84
43839; invariant sampling frames, 400; model-based meta-analysis, Shapiro, David, 180, 544
566; large-sample theory in, 4546, 38487; overview, 6, 12; precision Shapiro, Diana, 180, 544
213, 214, 3067; probability sam- factor, 5657, 6062, 63, 65, 66; Shapiro, Peter, 467
pling, 556; Q test vs. I2, 263; printed search tools, 66, 75, 7778, Sharpe, Donald, 46263
restrictions on, 539, 54041; sub- 9192; recall factor, 5660, 67; rel- Sheeran, Paschal, 120
groups, 401, 448, 554; threats to evance issue, 67; reporting on Shepherdson, David J., 153
validity from, 460, 54950 methods, 52829; reviewers prog- Shi, J. Q., 431
Sampson, Margaret, 90 ress, 5253; and sensitivity analy- shrinkage estimator, 3046, 307, 428
Snchez-Meca, Julio, 546 sis, 41920; as source of threats Shulz, Kenneth, 134
SAS (Statistical Analysis System) to validity, 461; statistical consid- SIGLE (System for Information on
software, 414 erations, 4344; subject indexing, Grey Literature in Europe), 104,
Sato, Tosio, 244 59, 6265; Web and Internet 111
scatterplot. See funnel plot sources, 53, 94, 110, 436. See also signal-to-noise ratio, 302
612 INDEX

significant results: bias toward report- relationship analysis, 156; and for- spreadsheets as coding forms, 17273
ing, 400, 402, 436, 448, 461; and est plot, 50510; grey literature, SPSS Missing Value Analysis soft-
publication bias, 437; testing in 1089, 118, 11920; missing data ware, 408, 410, 414
fixed-effects models, 287, 28990; handling, 402, 403; model-based SSCI (Social Sciences Citation
vote counting procedures, 208, 209 meta-analysis, 378; multiple- Index), 8, 5455, 5557, 90, 99,
sign test, 20910 endpoint studies, 37071; nested 112
simple random-effects model, 301, designs, 338, 345; and policy- SSRN (Social Science Research
3023 oriented research analysis, 47980, Network), 112
simple ratio-log ratio, 364 48485; publication bias, 43940, Stagner, Matt, 484
simulated pseudo-data approach to 44142, 444, 44647; regression standard deviation: artifact attenua-
publication bias, 44647 tests in, 441; reporting standards in, tion correction, 32931
simultaneous tests, fixed-effects mod- 564; research synthesis impact on, standard error: and available-case
els, 287 8; sensitivity analysis, 41819, estimates, 405; of effect size, 223,
Singer, Judith, 49899 42223; threats to validity, 134; 446; means in fixed-effects models,
Singer, Robert, 81 treatment effects and effect sizes, 285; and missing data handling,
Singh, Harshinder, 430 22223. See also Campbell Collab- 409; in multiple regression analy-
single artifacts, 31921 oration (C2); education field sis, 290, 380; and nested designs,
single-case research, 2930 Social Sciences Literature Informa- 338; and random-effects models,
single-value imputation, 4068, tion System (SOLIS), 112 307; in reporting, 502
41314 Social Services Abstracts & InfoNet, standardized mean difference (d, ):
Siotani, Minoru, 387 114 computing, 22730; conversion to-
Sirin, Selcuk, 402 Social Work Abstracts, 8889, from other effect sizes, 23134;
skewness, 422 99100 fixed-effects models, 264, 28586;
Slavin, Robert, 486, 487, 512, 513 Sociological Abstracts, 100, 115 multiple-endpoint studies, 36874;
slope coefficients in multiple regres- Sociology: A Guide to Reference and multiple-treatment studies, 36568;
sion analysis, 380 Information Sources (Aby, Nalen, nested designs, 338, 34044, 348
Smeets, Rob, 54 and Fielding), 78 54; overview of meta-analysis use,
Smith, Mary Lee, 7, 8, 139, 178, 180, software tools: coding, 17173; Com- 45; random-effects models, 272
18586, 19394, 196, 197, 515, prehensive Meta-Analysis, 508, 73; vs. raw mean difference, 224;
545 517n8; listwise deletion default, and stem-and-leaf plot, 42021; in
Smith, Paul, 7, 2089, 476 4034; missing data handling, vote counting, 21112, 21415
Smith-Richardson Foundation, 482 4034, 405; multilevel structural standardized mean difference d and g,
snowballing, 59 equation modeling, 392, 393; mul- 22531
Social Care Institute for Excellence tiple regression analysis, 289; standards of evidence issue, 485,
(SCIE), 13334 NORM freeware, 411, 413; publi- 48688
Social Care Online, 114 cation bias adjustment, 449; SAS, Stanley, Julian, 539
social psychology, meta-analysis in, 7 414; Schmidt and Le artifact distri- Stata, 507, 517n7
Social Science Research Network bution, 327; SPSS Missing Value statistical analysis: conclusion, 46;
(SSRN), 112 Analysis, 408, 410, 414; Stata, 507, data analysis, 4446; data collec-
Social Sciences Citation Index 517n7; StatXact, 247; WinBUGS, tion, 4344; introduction, 38; as
(SSCI), 8, 5455, 5557, 90, 99, 426 meta-analysis domain, 67; prob-
112 SOLIS (Social Sciences Literature lem formulation, 3843. See also
social sciences field: and certainty Information System), 112 meta-analysis
evaluation, 457, 45961, 463, Song, Fujian, 106 statistical conclusion validity, 131,
465, 46869; consumer-oriented Soydan, Haluk, 155 135
evaluation devices, 568; crime and Spearman rank order correlations, Statistical Methods for Research
justice, 8189, 120, 15354, 155, 196 Workers (Fisher), 7
467; database resources, 5455, specialized coders, 174, 183 Statistical Procedures for Meta-
76, 78, 8990, 95100; effect size Sporer, Siegfried, 81 Analysis (Hedges and Olkin), 8
INDEX 613

StatXact software, 247 Swaisgood, Ronald R., 153 three-level nested designs, 34553
stem-and-leaf plot, 42021, 5023, Swanson, Lee, 120 three-variable relationships and evi-
510 Sweeting, Michael, 244, 246 dence sources, 3033
Sterne, Jonathan, 502 systematic artifacts, 31924 time factor: in coding, 175, 199; and
stochastically dependent effect sizes: systematic review, 6. See also re- eligibility criteria, 16263; in vari-
conclusion, 37476; introduction, search synthesis able measurement, 14950; years of
358; proportions in multiple-treat- systematic variance. See between- coverage for databases, 7778, 93
ment studies, 35865; standardized studies comparisons time-lag bias, 105, 448
mean differences in multiple-end- System for Information on Grey Lit- time-series studies, 2526, 2930
point studies, 36874; standardized erature in Europe (SIGLE), 104, Tippett, Leonard H. C., 7
mean differences in multiple-treat- 111 training of coders, 17375, 182, 551
ment studies, 36568; as threat to Szurmak, Joanna, 104 Tran, Zung Vu, 544
validity, 54647 transformation bias, 546
Stock, William, 173, 18586 tables in meta-analytic reports, 498, transparency, 16061, 48384
Stokes-Zoota, Juli, 459 499, 500501 Transparent Reporting of Evaluations
Straf, Miron, 476, 478, 563, 57071 Tanner, Martin, 412 with Nonrandomized Designs
stratified methods for covariate esti- Task Force on Community Preventa- (TREND), 564
mates, 246, 247 tive Services, 13233 Transportation Research Information
Strike, Kenneth, 568 Taveggia, Thomas, 7 Services (TRIS) Online, 115
structural equation modeling, 56768. Taylor Series-based methods, 329 Trawling the net, 117
See also model-based Terpstra, David, 180 treatments: association with outcome
meta-analysis tetrachoric correlations, 258 classes, 54349; and coding proto-
studies: number of, 438, 440; vs. re- Tetzlaff, Jennifer, 121 col development, 165; and effect
porting in quality, 132. See also theoretical considerations: competi- sizes, 22223, 36667, 400401,
primary research tive theory testing, 457, 46769; 548; multiple-treatment studies,
study- vs. synthesis-generated evi- Cronbachs generalizability (G) 35868
dence, 3033 theory, 19091, 193; effect size pa- TREND (Transparent Reporting of
subgroup analyses and reporting, 401, rameters, 4142; generalization Evaluations with Nonrandomized
448, 554 theory need, 556; large-sample the- Designs), 564
subject coverage, database, 77 ory, 4546, 213, 214, 3067; and TrialsCentral, 116
subject headings in controlled vocab- past vs. present work, 555; and re- trim-and-fill method, 429, 44344,
ulary, 64 search synthesis, 6, 456; uncer- 501
subject indexing, 59, 6265, 8384, 85 tainty in, 46465 TRIS (Transportation Research Infor-
subject search, term selection, 8184 thesauri, database, 23, 64, 8283, mation Services) Online, 115
subsamples, 149 84, 92 true vs. estimated effect size, 296,
subscales, 358 Thesaurus of ERIC Descriptors, 64 3034, 307, 425
subsequent research, research synthe- Thesaurus of Psychological Index Turkheimer, Eric, 5034
sis contribution to, 6, 459, 46061, Terms, 64 Turner, Herbert, 107
46970, 563 Thompson, Simon, 263, 302 Tweedie, Richard, 429
substantive expertise, coders with, Thomson Scientific, 5455, 58, 60, Twenge, Jean, 465, 468, 469, 544
174, 183 6667 two-level hierarchical linear model.
substantive factors, 15051, 153, 155. threats to validity: attrition as, 134; See random-effects models
See also variables crossover, 134; defined, 130; het- two-level nested designs, 33845, 354
substitution, method of, 250 erogeneity as, 45960; overview, 8, two-stage sampling design, 29798
suppression of information problem, 540; random-effects models, 307 two-variable relationships and evi-
105, 436, 437, 444, 44546 8; in research synthesis, 8; statisti- dence sources, 3031, 32
Sutton, Alexander, 24647, 400, cal conclusion validity, 135. See
43552 also biases; certainty-uncertainty; U. S. Government Information: Tech-
Svyantek, Daniel, 180, 183 generalized inferences nical Reports, 117
614 INDEX

UK Centre for Reviews and Dissemi- coding, 33, 164, 178; defined, 130; and phi coefficient, 242; pooled,
nation, 117 discriminant, 541; external, 29, 269; in random-effects models,
Ukoumunne, Obioha, 354 13031, 13435, 460; internal, 130, 26264, 27074, 29798, 300306,
uncertainty. See certainty-uncertainty 134, 142; and problem formulation, 307; and uncertainty in Bayesian
unconditional inference model, 38, 34; quality of primary research analyses, 298, 307, 567; within-
3941, 280. See also random- framework for, 13031, 13435; group, 28283, 347. See also corre-
effects models and quality scale problems, 138; lation (covariance) matrices
unconditional shrinkage estimation, statistical conclusion, 131, 135; and variance component. See between-
3046 variable relationships, 29; and studies comparisons
unconditional variance in random- weighting schemes, 26061. See variance-covariance matrix and miss-
effects models, 27172 also construct validity ing data, 4045
uncorrected vs. corrected correlations, van Ijzendoorn, Marinus, 45859, 461 variance stabilizing metric, 36465
325 Vargas, Sadako, 47980 variation to signal ratio, 302
underrepresentation of prototypical variables: categorical vs. continuous, Varnell, Sherri, 354
attributes, 550 280; in coding, 14758; concep- Verma, Vijay, 354
undetermined designation for coded tual, 2021, 2224; conditional in- Vevea, Jack L., 41, 177203, 270, 430
information, 179 ference model, 39; continuous, Viechtbauer, Wolfgang, 54344
unequal sample sizes in nested de- 22135, 280, 28992, 5045; de- Virtual Technical Reports Central,
signs, 35152 fining, 2024; identifying for meta- 117
unitizing method, 81, 82 analysis, 14851; indicator, 386; Visher, Christy A., 486
unity of statistical methods, 45 inference codes, 33; interaction visual presentation, 489, 498511,
univariate analysis in multiple- approach, 30; isolation of question- 517n1
endpoint studies, 37374 able, 197; latent, 380, 431; opera- vocabulary of document retrieval,
universe of generalization: biases in tional, 2024; and problem 6265, 81, 92. See also search
inferences, 55051; conditional vs. formulation, 2030; proxy, 463, terms
unconditional inference models, 38 464; relations estimation, 45762; vote counting: combining with effect-
41; limited heterogeneity in, 541, ruling out irrelevant, 541; signifi- size estimates, 21416; conclu-
55051, 552; and random-effects cance test, 208; and study- vs. syn- sions, 219; conventional procedure,
models, 297; representativeness of, thesis-generated evidence, 3033; 20813, 214; integration with nar-
4344; target vs. novel, 555, 556 substantive, 15051, 153, 155; and rative description, 513; introduc-
unpublished literature, 6162, 79, uncertainty in research synthesis, tion, 208; missing data problem,
400, 420. See also grey literature 461; and validity question, 34. See 219; publication bias application,
unreliability, 185, 54445. See also also descriptive data; effect-size 21617, 218; results using, 208,
reliability parameters; mediating effects; 217, 219
unstandardized (raw) mean difference moderator effects
(D), 22425, 27475 variance: ANCOVA, 22830, 230; Wachter, Kenneth, 476, 478, 563,
unsystematic (random) artifacts, ANOVA, 39, 190, 191, 28088; be- 57071
31819 tween-studies, 27075, 292, 388, Waldron, Mary, 5034
Upchurch, Sandra, 163 39091, 42728, 548; and Birge Wallace, David, 466
Urbadoc, 115 ratio, 292; corrected correlations, Wang, Morgan, 20720
Urquhart, Christine, 67 32527; corrected standard devia- Washington State Institute for Public
utility optimization in policy-oriented tion, 32930; dependence in data Policy, 482
research syntheses, 48485, 489, collection, 44; effect size calcula- Watson, David, 46869
490n7 tion overview, 22324; estimation, Web as literature resource, 53, 94,
242, 270, 298; in fixed-effects 110, 436
Valentine, Jeffrey C., 12946 models, 39, 26270, 29293; inter- Webb, Eugene, 2223
Valeri, Sylvia, 120 quartile range (IQR) measurement, Webb, Thomas, 11920
validity: causal inference, 140, 549; 421; multifactor analysis of, 288 Web Citation Index, 66
and certainty level, 456; and 89; one-factor model for, 28088; Web of Knowledge, 109
INDEX 615

Web of Science database, 8, 5455, 477; quality of research evaluation, Woolf, Barnet, 244
58, 66, 477, 490n4 133; translation of effect sizes, 488 Working Group on American Psycho-
weighted Kappa, 18789, 19192, White, Howard D., 5171 logical Association Journal Article
192 Whitehurst, Grover, 189 Reporting Standards (the JARS
weight function w(z), 43031, 44547 Whiteside, Mary F., 384, 385 Group), 522, 533
weighting of study data: combining WIC program, 482, 483, 485, 486 WorldCat, 92, 100
effect sizes, 25961, 269; and de- Willett, John, 49899 Wortman, Paul, 186
pendent effect sizes, 36162; in Williamson, Paula, 401, 402 Wright, Almroth, 25859, 260
fixed-effects models, 26162, 284, William T. Grant Foundation, 482 Wu, Meng-Jia, 389, 394, 412
29091; and heterogeneity, 262 Wilson, David B., 119, 156, 15976, WWC (What Works Clearinghouse).
63; importance for threat avoid- 459, 54445 See What Works Clearinghouse
ance, 547; log odds ratio, 246; Wilson, Patrick, 58, 6061, 62, 65 (WWC)
multifactor analysis of variance, Wilson, Sandra, 155, 485, 487, 48889
28889; multiple-endpoint studies, Wilson Education Index, 100 Yates Frank, 7, 41
36970; multiple regression analy- WinBUGS software, 426 year of publication, 463, 465
sis, 28990; multiple-treatment within-group variance, 28283, 347 years of coverage, database, 7778, 93
studies, 361; and quality of primary within-studies mean squares (WMS), Yeaton, William, 186
research, 26061, 262; in random- 190, 191 Yerkey, Neil, 52
effects models, 272; and reliability, within-study comparisons: and Bayes- Yoder, Janice, 464
26061, 325; and selection model- ian analysis, 298; moderator vari- Yu, Eunice, 181, 198
ing, 445; and sensitivity analysis, able effects, 463, 464, 553; and Yule, 242
424, 426, 427; and within-study sample in primary research, 259 Yules y, 191
sample size, 25960, 267 60, 264, 26769, 292 Yurecko, Michele, 47980
Weintraub, Irwin, 105 WLS (weighted least squares) regres- Yusuf, Salim, 24445, 506
Weisz, John, 120, 46061 sion and MOM, 31112 Yusuf-Peto method, 247
Wells, Brooke, 465 WMS (within-studies mean squares),
Wells, Gary, 81 190, 191 Zhao, Ping, 430
What Works Clearinghouse (WWC): Wolf, Frederic, 478 z-transformed correlation, 46, 264
and evidence-based practices, 481, Wong, Wing, 412 65, 27374, 38891
484, 48586, 487; policy role of, Wood, Wendy, 45572 Zwick, Rebecca, 189

You might also like