Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/257591589
CITATIONS
DOWNLOADS
VIEWS
385
218
4 AUTHORS, INCLUDING:
Andreas Ivarsson
Magnus Lindwall
Halmstad University
University of Gothenburg
20 PUBLICATIONS 64 CITATIONS
SEE PROFILE
SEE PROFILE
Review
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 26 April 2012
Received in revised form
23 July 2012
Accepted 24 July 2012
Available online 1 September 2012
Objectives: The main objectives of this article are to: (a) investigate if there are any meaningful differences between adjusted and unadjusted effect sizes (b) compare the outcomes from parametric and nonparametric effect sizes to determine if the potential differences might inuence the interpretation of
results, (c) discuss the importance of reporting condence intervals in research, and discuss how to
interpret effect sizes in terms of practical real-world meaning.
Design: Review.
Method: A review of how to estimate and interpret various effect sizes was conducted. Hypothetical
examples were then used to exemplify the issues stated in the objectives.
Results: The results from the hypothetical research designs showed that: (a) there is a substantial
difference between adjusted and non-adjusted effect sizes especially in studies with small sample sizes,
and (b) there are differences in outcomes between the parametric and non-parametric effect size
formulas that may affect interpretations of results.
Conclusions: The different hypothetical examples in this article clearly demonstrate the importance of
treating data in ways that minimize potential biases and the central issues of how to discuss the
meaningfulness of effect sizes in research.
2012 Elsevier Ltd. All rights reserved.
Keywords:
Adjusted effect size
Practical signicance
Statistical interpretation
Pieter, & Stark, 2009; Jacobson & Truax, 1991). Another criticism
of NHST is that, for some studies, results may be a reection of the
power of the research design, which can be easily manipulated by
changing sample sizes (Kirk, 1996). Henson (2006) presented an
example where the p value in a randomized intervention study was
.051. By adding just one person (under the conditions that the M
and SD stayed the same) into each group, the p value would
decrease to .049. In this case, if one judged the intervention effect
based on just the p value, then the intervention is effective in the
larger study (N 18) and not effective in the smaller study (N 16).
This issue of sample-size inuences on p values in experimental
and correlational designs, and how there may be potential biases
when discussing the value of research ndings, have also been
explored in sport and exercise science (e.g., Andersen & Stoov,
1998; Edvardsson, Ivarsson, & Johnson, 2012). That p values are
sensitive to sample size also inuences review and meta-analytic
studies that are based on NHST results. For example, Hedges and
Olkin (1980) highlighted vote-counting methods as one problematic approach because they often only count the number of studies
that report statistically signicant results when comparing means
for experimental and control groups.
98
P
P
P
N XY X Y
r r
P
P
P
P
N X 2 X2 N Y 2 Y2
When discussing the bivariate r effect size, it could also be useful to
highlight two other effect sizes that could be used in correlational
research. These two effect sizes are: point-biserial correlation (rpb)
that should be used if one of the variables in the correlation is
dichotomous, and the phi coefcient (4 or r4) that is used in
correlations with two dichotomous variables (Bonett, 2007). For
calculation formulas for these other correlations see Fritz et al.
(2012).
Probably the most common effect size, when comparing the
standardized mean differences between two groups, is Cohens
d for independent means (Thomas, Nelson, & Silverman, 2005),
which is the difference between the means of two groups
divided by the pooled standard deviation of the two groups
(Cohen, 1988; Nakagawa & Cuthill, 2007). This effect-size indicator is used when the aim is to compare the magnitude of
difference between two conditions. One formula for calculating
Cohens d, when the distributions of both groups meet the
criteria for using parametric tests and group ns are equal
(Rosenthal & Rubin, 2003), is:
M1 M2
d r
SD12 SD22
2
In the formula M1 and M2 are the group means and SD1 and SD2
are the groups standard deviations. Bivariate r and Cohens d can be
1=2
d2 = d2 4
1=2
d 2r= 1 r 2
An additional group of effect sizes that have also been used is based
on odds ratios (OR; odds ratio, relative risk, risk difference). The OR
group of effect sizes are used to compare relative likelihood or risk
for a specic outcome between two or more groups (Ferguson,
2009). The odds ratio is calculated by the formula: AD/BC. The
letters A/B/C/D represent observed cell frequencies when a study
has two groups and two possible outcomes (A group 1/outcome
1; B group 1/outcome 2; C group 2/outcome 1; D group
2/outcome 2; Nakagawa & Cuthill, 2007). The OR effect size could
be converted into r by using Pearsons (1900) formula:
r cos p=1 OR1=2
Our intention with this article is to highlight and discuss a few
practical issues that might occur when using and interpreting effect
sizes. The main purposes of this article are to: (a) investigate if there
are any meaningful differences between adjusted and unadjusted
effect sizes, (b) compare outcomes from parametric and nonparametric effect sizes that may affect interpretations of results,
(c) discuss the importance of reporting condence intervals for
effect sizes in research, and explore how to interpret effect sizes in
terms of practical real-world meaning.
Adjusted vs. unadjusted effect sizes
One issue that is rarely discussed in the sport and exercise
psychology literature is that any effect size can be in one of two
forms, adjusted and unadjusted. The difference between these two
conditions is that in an adjusted effect size the magnitude of the
effect is "corrected" to allow for generalization to the population.
The unadjusted effect size is sample specic and tends to be an
overestimation of the population effect size (Thompson, 2006).
Thompson (2002a) listed three different design issues that will
affect the potential sampling variance: (a) sample size, (b) number
of variables measured, and (c) population effect size. In order to
adjust for the potential sampling variance, several formulas have
been suggested. All formulas have in common that it is the R2 value
that will be adjusted. Probably the most well-known formula for
adjusted R2 was developed by Ezekiel (1930).1 Wang and Thompson
(2007) found that Ezekiels formula, in comparison to other suggested formulas (e.g., Claudys, 1978), provide a better and more
reliable result (in most cases). Ezekiels formula states that an
adjusted standardized difference (d*) could be calculated by using
the unadjusted standardized difference (d). The steps for
this calculation
are: (a)
p
converting the d into an r (using the formula r d= d2 4, (b) squaring the r, (c) using the Ezekiels
formula to calculate the adjusted effect r2* the formula
is r 2* r 2 1 r 2 v=N v 1 where v is the number of
predictor variables), (d) taking the square root of the r2* (i.e., r*) and
then using this value to calculate d* in the formula
d* 2r*1 r 2* 1=2 (Thompson, 2002a). In addition, Ezekiels
formula could be used to correct bivariate r effect sizes (Wang &
Thompson, 2007).
Ezekiels (1930) correction formula is used in SPSS to calculate the adjusted R2.
20
40
80
160
20
40
80
160
Unadj. d
Adj. d*
Difference (%)
.50
.50
.50
.50
.80
.80
.80
.80
.16
.38
.44
.47
.65
.71
.76
.77
68%
24%
12%
6%
18.75%
11.25%
5%
3.75%
99
Effect sizes are used to answer the question, "How big is it?" (i.e.,
what is the magnitude of the effect?; Nakagawa & Cuthill, 2007).
As stated previously in this article, there are several different types
of effect sizes. Most effect size estimates have the assumption that
the data are reasonably normally distributed. For differences
between two independent groups when nonparametric tests have
been performed, the value of the z distribution could be used to
calculate the effect size (Fritz et al., 2012). To calculate the effect
rpb) for some
size (in this case the point-biserial correlation
p
nonparametric tests, the formula rpb z= N could be used. In the
formula, z is the z value that would be obtained from performing
a ManneWhitney or Wilcoxon test, or it could be calculated by
hand, and N is the sample size in the study. To use the effect size
estimate rpb to calculate the Cohens d value, the formula
q
2 is used (Fritz et al., 2012). In looking through
d 2r= 1 rpb
the literature, only a few studies have used this formula for
nonparametric tests. Considering the low numbers of articles using
the formula for nonparametric tests, there might be signicant
underestimations of effect sizes and the subsequent interpretations of their practical signicance.
To illustrate with an example, we present a ctional study with
the aim to test a preventive intervention for lowering sport injury
occurrence. In the study design, two groups with equal ns exists,
one intervention and one control, and the outcome variable is
number of injuries per person (e.g., 0, 1, 2). The data are substantially skewed with more than half the participants in the intervention group receiving scores of "0" injuries. The mean and SD for
the intervention group are M .40 and SD .737, and for the
control group M .93 and SD .799. Using these values results in
a Cohens d of .69.
0:93 0:40
d r
:7992 :7372
2
If we use the z value from a ManneWhitneys U test (with the
same data) instead to calculate the Cohens d with the formula
Fritz et al. p(2012)
suggested, the effect
size will
be .79
p
100
p
r 2 sin* rs
6
The result from this calculation, using an rs of .18, is that the estimated Pearsons r is approximately .188. The result shows a difference between the two Pearsons r coefcients, where the coefcient
calculated from rs is smaller than the effect size that was directly
calculated from the Pearsons r formula. In this case, however, the
difference is not that large (.21e.188 .021).
Interpretations of effect sizes: what is meaningful?
Even though the reporting of effect sizes has increased in the
sport and exercise psychology literature, Andersen et al. (2007)
have suggested that real-world interpretation of what effect sizes
mean is still not common practice. To supply researchers with
conventions for how to interpret effect sizes for differences
between groups, Cohen (1988) suggested three categories: small
(d .20 r .10 OR 1.50), medium (d .50 r
.24 OR 3.50), and large (d .80 r .37 OR 5.10).
But Cohens conventions are just that, only conventions and not
hard rules of thumb. Kraemer et al. (2003) recommended that
researchers should not only use these suggested categories when
discussing the practical value of the study but also consider what
might be a clinically or meaningfully signicant effect in real-world
terms. A small effect, by Cohens conventions, might translate to
outcomes that may have large effects in terms of costs and benets
for the population in question. In discussing the practical value of
an effect size, Vacha-Haase and Thompson (2004) recommended
considering what variables are being measured as well as the
context of the study.
To help researchers to interpret whether a result is meaningful (e.g., clinical signicance, see Thompson, 2002a), several
different statistics have been developed (Fritz et al., 2012;
Kraemer et al., 2003). Three examples are: condence intervals
for effect sizes (CI; Thompson, 2002b), probability of superiority
(PS; Fritz et al., 2012; Grissom, 1994) combined with common
language effect size (Dunlap, 1994; McGraw & Wong, 1992), and
number needed to treat (NNT; Nuovo, Melinkow, & Chang, 2002).
CI describes the interval where most (90% or 95%) of the
participants in a study are located for a specic variable
(Thompson, 2002b) and could be used to interpret the results
from one study with results from other studies (Thompson,
2002b; Wilkinson & The Task Force on Statistical Inference,
1999). CIs for effect sizes, such as Cohens d, at the 95% level,
can be calculated with the formula: 95% CI ES -1.96se to
ES 1.96se (Nakagawa & Cuthill, 2007) where se is the asymptotic standard error of the effect size (to calculate the 90% level
change 1.96se into 1.645). The se value, based on Cohens d effect
size, can be calculated with the formula:
se d
s
n1 n2 1
4
d2
1
8
n1 n2 3
n1 n2
1
se r p
n3
If the 95% or the 90% CI for the ES does not include .0, or a negative
number, then one can be fairly condent that some effect has taken
place. This interpretation reects NHST in that the effect is significant at the p < .05 (for the 95% condence interval) or p < .10 (for
X 1 X 2
ZCL q
S21 S22
The proportion of the area under the normal curve that is below ZCL
is the CL statistic that one could use to estimate the PS (Grissom,
1994; Grissom & Kim, 2012). As an example, one could have
a study from which the value of the calculated ZCL is 1.0. For
the 1.0 value, the proportion of the area under the normal curve is
estimated to be .84. If the researcher has the data available, the
calculation formula for PS is: PS U/(mn), where U is the Manne
Whitney statistic and m and n are sample sizes for the groups.
As one example how to use the PS, let us say that we have
conducted a study with the aim to increase the participants
general subjective well-being by using a 4-week stress management intervention. The results of the study show an increase in
well-being for the intervention group compared to no change for
the control group with a Cohens d of .75, which is equal to a PS
score of approximately 70 (to obtain the formula for the PS
calculation for different estimates of effect size see Grissom, 1994;
Ruscio, 2008). A PS score of 70 states that if participants were
sampled randomly, one from each of the groups, the one from the
condition with the higher mean (in this example the intervention
group) should be bigger than the one from the experimental group
for 70% of the pairs. Given that subjective well-being is an
important variable, and that 70% of the experimental group reported higher levels of well-being than the control at the end of
the study, these results point to the potential practical value of the
intervention.
A third example of an indicator developed to clarify the practical value of effect sizes is number needed to treat (NNT). The NNT
score is the number of participants who must be treated to give
one more success/one less failure as one outcome of an intervention. The NNT effect size indicator is primarily used in research
with one dichotomous outcome variable (Ferguson, 2009). To
calculate the NNT indicator, the percent of failure cases, in decimal
form, in the experimental group should be subtracted from the
percent of failure cases in the control group. The score from this
calculation is discussed as risk difference (RD). One formula for
calculating the NNT, using the RD score, is 1/RD. In this formula, the
result of 1 is the best NNT score indicating that the treatment is
perfect (i.e., all participants in the experimental group have
improved whereas no participants in the control group have;
Kraemer et al., 2003). In order to illustrate the use of NNT, we will
Summary
The overall aim of this article was to highlight and discuss some
important issues around reporting effect sizes in sport and exercise
psychology research. The different ctional research examples in
this article clearly demonstrate the importance of treating data in
proper ways to minimize potential biases but also how to discuss
the practical value of effect sizes in research. The ctional examples also illustrate three major points. First, our examples suggest
that it is important to use adjusted effect sizes, especially for
studies with small samples and large effect sizes, to avoid overestimations. Second, using parametric effect size formulas for
nonparametric data will often result in possibly misleading effect
sizes with either underestimations of the effects (e.g., for data with
one categorical variable) or overestimations of the effects (e.g., for
correlational data). Given that our hypothetical examples showed
differences between both parametric and non-parametric and
adjusted and non-adjusted effect sizes, choosing the proper
formula is important for interpreting results (e.g., results from
a meta-analysis). For example, meta-analyses are performed to
determine the mean effect size across studies (Iaffaldano &
Muchinsky, 1985), and the results could be biased if the studies,
integrated in the analysis, had positively (or negatively) biased
effect sizes. For researchers conducting meta-analyses, if there is
enough information to adjust unadjusted effect sizes or to use nonparametric calculations to transform what may be biased effect
sizes, then maybe before effect sizes are entered into metaanalyses, the researchers could do their own adjusting of results
rst. Third, the article highlights three indicators (i.e., CI, PS, NNT)
that have been developed to assess effect sizes, and researchers
may want consider the advantages of reporting these effects as
complements in their discussions about how to interpret research
ndings.
References
American Psychological Association. (2010). Publication manual of the American
Psychological Association (6th ed.). Washington, DC: Author.
Andersen, M. B., McCullagh, P., & Wilson, G. (2007). But what do the numbers really
tell us? Arbitrary metrics and effect size reporting in sport psychology research.
Journal of Sport & Exercise Psychology, 29, 664e672.
101
Andersen, M. B., & Stoov, M. A. (1998). The sanctity of p < .05 obfuscates good
stuff: a comment on Kerr and Goss. Journal of Applied Sport Psychology, 10, 168e
173, org/10.1080/10413209808406384.
Bishara, A. J., & Hittner, J. B. (2012). Testing the signicance of a correlation with
nonnormal data: comparison of Pearson, Spearman, transformation, and
resampling approaches. Psychological Methods. http://dx.doi.org/10.1037/
a0028087, Advance online publication.
Bonett, D. G. (2007). Transforming odds ratios into correlations for meta-analytic
research. American Psychologist, 62, 254e255.
Claudy, J. G. (1978). Multiple regression and validity estimation in one sample.
Applied Psychological Measurement, 2, 595e607. http://dx.doi.org/10.1177/
014662167800200414.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Erlbaum.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155e159. http://
dx.doi.org/10.1037/0033-2909.112.1.155.
Dunlap, W. P. (1994). Generalizing the common language effect size indicator to
bivariate normal correlations. Psychological Bulletin, 116, 509e511. http://
dx.doi.org/10.1037/0033-2909.116.3.509.
Edvardsson, A., Ivarsson, A., & Johnson, U. (2012). Is a cognitive-behavioural
biofeedback intervention useful to reduce injury risk in junior football
players? Journal of Sport Science and Medicine, 11, 331e338.
Ezekiel, M. (1930). The sampling variability of linear and curvilinear regressions:
a rst approximation to the reliability of the results secured by the graphic
"successive approximation" method. The Annals of Mathematical Statistics, 1.
http://dx.doi.org/10.1214/aoms/1177733062, 275e315, 317e333.
Ferguson, C. J. (2009). An effect size primer: a guide for clinicians and researchers.
Professional Psychology: Research and Practice, 40, 532e538. http://dx.doi.org/
10.1037/a0015808.
Finch, S., & Cumming, G. (2009). Putting research in context: understanding
condence intervals from one or more studies. Journal of Pediatric Psychology,
34, 903e916. http://dx.doi.org/10.1093/jpepsy/jsn118.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: current use,
calculations, and interpretation. Journal of Experimental Psychology, 141, 2e18.
http://dx.doi.org/10.1037/a0024338.
Frhlich, M., Emrich, E., Pieter, A., & Stark, R. (2009). Outcome effects and effect
sizes in sport sciences. International Journal of Sports Science and Engineering, 3,
175e179.
Grissom, J. R. (1994). Probability of superior outcome of one treatment over another.
Journal of Applied Psychology, 79, 314e316. http://dx.doi.org/10.1037/00219010.79.2.314.
Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate
applications. New York, NY: Taylor & Francis.
Hedges, L. V., & Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological Bulletin, 88, 359e369. http://dx.doi.org/10.1037/0033-2909.88.2.359.
Henson, R. K. (2006). Effect-size measures and meta-analytic thinking in counseling
psychology research. The Counseling Psychologist, 34, 601e629. http://
dx.doi.org/10.1177/0011000005283558.
Iaffaldano, M., & Muchinsky, P. M. (1985). Job satisfaction and job performance:
a meta analysis. Psychological Bulletin, 97, 251e273.
Jacobson, N. S., & Truax, P. (1991). Clinical signicance: a statistical approach to dening
meaningful change in psychotherapy research. Journal of Consulting and Clinical
Psychology, 59, 12e19. http://dx.doi.org/10.1177/00343552060500010501.
Keef, S. P., & Roberts, L. A. (2004). The meta-analysis of partial effect sizes. British
Journal of Mathematical and Statistical Psychology, 57, 97e129. http://dx.doi.org/
10.1348/000711004849303.
Kirk, R. (1996). Practical signicance: a concept whose time has come. Educational
and Psychological Measurement, 56, 746e759. http://dx.doi.org/10.1177/
0013164496056005002.
Kraemer, H. C., Morgan, G. A., Leech, N. L., Gliner, J. A., Vaske, J. J., & Harmon, R. J.
(2003). Measures of clinical signicance. Journal of the American Academy of
Child and Adolescent Psychiatry, 42, 1524e1529. http://dx.doi.org/10.1097/
01.chi.0000091507.46853.d1.
Leach, L. F., & Henson, R. K. (2007). The use and impact of adjusted R2 effects in
published regression research. Multiple Linear Regression Viewpoints, 33, 1e11.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic.
Psychological
Bulletin,
111,
361e365.
http://dx.doi.org/10.1037/00332909.111.2.361.
Nakagawa, S., & Cuthill, I. C. (2007). Effect size, condence interval and statistical
signicance: a practical guide for biologists. Biological Reviews, 82, 591e605.
http://dx.doi.org/10.1111/j.1469-185X.2007.00027.x.
Nuovo, J., Melnikow, J., & Chang, D. (2002). Reporting number needed to treat and
absolute risk reduction in randomized controlled trails. Journal of the American
Medical Association, 287, 2813e2814. http://dx.doi.org/10.1001/jama.287.21.2813.
Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On
the correlation of characters not quantitatively measurable. Philosophical
Transactions of the Royal Society of London. Series A, Containing Papers of Mathematical or Physical Character, 195, 1e47.
Roberts, K. J., & Henson, R. K. (2002). Correction for bias in estimating effect sizes.
Educational and Psychological Measurement, 62, 241e252. http://dx.doi.org/
10.1177/0013164402062062002003.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury Park, CA:
Sage.
Rosenthal, R., & Rubin, D. B. (2003). Requvalent: a simple effect size indicator. Psychological
Methods, 8, 492e496. http://dx.doi.org/10.1037/1082-989X.8.4.492.
102
Thompson, B. (2002b). What future quantitative social science research could look
like: condence intervals for effect sizes. Educational Researcher, 31, 25e32.
http://dx.doi.org/10.3102/0013189X031003025.
Thompson, B. (2006). Foundations of behavioral statistics: An insight-based approach.
New York, NY: Guilford Press.
Vacha-Haase, T., & Thompson, B. (2004). How to estimate and interpret various
effect sizes. Journal of Counseling Psychology, 51, 473e481. http://dx.doi.org/
10.1037/0022-0167.51.4.473.
Wang, Z., & Thompson, B. (2007). Is the Pearson r2 biased, and if so, what is the best
correction formula? Journal of Experimental Education, 75, 109e125. http://
dx.doi.org/10.3200/JEXE.75.2.109-125.
Wilkinson, L., & , The Task Force on Statistical Inference. (1999). Statistical methods
in psychology journals: guidelines and explanations. American Psychologist, 54,
594e604. http://dx.doi.org/10.1037/0003-066X.54.8.594.