You are on page 1of 5

132 to 136_Cunningham.qxd:hora soltani.

qxd

22/11/07

12:05

Page 132

Cunningham JB, McCrum-Gardner E. (2007) Power, effect and sample size using GPower: practical
issues for researchers and members of research ethics committees. Evidence Based Midwifery 5(4): 132-6

Power, effect and sample size using GPower: practical issues


for researchers and members of research ethics committees
Joseph B Cunningham1 MSc, BEd, RNT, RGA, RNLD, RNMH, DipN. Evelyn McCrum-Gardner2 PGCHET, CStat, MSc, BSc.
1 Lecturer in research design and methods, University of Ulster at Jordanstown, Newtownabbey BT37 0QB Northern Ireland. Email: jb.cunningham@ulster.ac.uk
2 Lecturer in health statistics, University of Ulster at Jordanstown, Newtownabbey BT37 0QB Northern Ireland. Email: ee.gardner@ulster.ac.uk

Abstract
Background. The issue of sample size has become a dominant concern for UK research ethics committees since their
reform in 2004. Sample size estimation is now a major, but often misunderstood concern for researchers, academic
supervisors and members of research ethics committees.
Aim. To enable researchers and research ethics committee members with non-statistical backgrounds to use freely
available statistical software to explore and address issues relating to sample size, effect size and power.
Method. Basic concepts are examined before utilising the statistical software package GPower to illustrate the use of
alpha level, beta level and effect size in sample size calculation. Examples involving t-tests, analysis of variance (ANOVA)
and chi-square tests are used.
Results. The examples illustrate the importance of effect and sample size in optimising the probability of a study to detect
treatment effects, without requiring these effects to be massive.
Conclusions. Researchers and research ethics committee members need to be familiar with the technicalities of sample
size estimation in order to make informed judgements on sample size, power of tests and associated ethical issues. Alpha
and power levels can be pre-specified, but effect size is more problematic. GPower may be used to replicate the examples
in this paper, which may be generalised to more complex study designs.
Key words: Power, sample size, beta level, effect size, research ethics committees, GPower
Introduction
Since the introduction of a new UK Ethics Committee
Authority (UKECA) in 2004 and the setting up of the
Central Office for Research Ethics Committees (COREC),
research proposals have come under greater scrutiny than
ever before. The era of self-regulation in UK research ethics
has ended (Kerrison and Pollock, 2005). The UKECA
recognise various committees throughout the UK that can
approve proposals for research in NHS facilities (National
Patient Safety Agency, 2007), and the scope of research for
which approval must be sought is defined by the National
Research Ethics Service, which has superceded COREC.
Guidance on sample size (Central Office for Research
Ethics Committees, 2007: 23) requires that the number
should be sufficient to achieve worthwhile results, but
should not be so high as to involve unnecessary recruitment
and burdens for participants. It also suggests that formal
sample estimation size should be based on the primary
outcome, and that if there is more than one outcome then
the largest sample size should be chosen. Sample size is a
function of three factors the alpha level, beta level and
magnitude of the difference (effect size) hypothesised.
Referring to the expected size of effect, COREC (2007: 23)
guidance states that it is important that the difference is
not unrealistically high, as this could lead to an underestimate of the required sample size.
In this paper, issues of alpha, beta and effect size will be
considered from a practical perspective. A freely-available
statistical software package called GPower (Buchner et al,

132

1997) will be used to illustrate concepts and provide practical assistance to novitiate researchers and members of
research ethics committees. There are a wide range of
freely available statistical software packages, such as PS
(Dupont and Plummer, 1997) and STPLAN (Brown et al,
2000). Each has features worth exploring, but GPower
was chosen because of its ease of use and the wide range
of study designs for which it caters. Using GPower, sample
size and power can be estimated or checked by those with
relatively little technical knowledge of statistics.
Alpha and beta errors and power
Researchers begin with a research hypothesis a hunch
about the way that the world might be. For example, that
treatment A is better than treatment B. There are logical
reasons why this can never be demonstrated as absolutely
true, but evidence that it may or may not be true can be
obtained by endeavouring to show that there is no difference in the outcomes of treatments A and B. The statement that can then be tested is there is no difference in
the outcomes of A and B, and this is called the null
hypothesis. The researcher wants to be able to reject the
null hypothesis and to show that differences in the treatments outcomes are not due to chance.
When a hypothesis is specified, researchers also state the
level of significance at which they will reject the null
hypothesis. The minimum level for rejection is usually
p=0.05, and this is the probability of making a Type 1
error rejecting the null hypothesis when it is in fact true.

2007 The Royal College of Midwives. Evidence Based Midwifery 5(4): 132-6

132 to 136_Cunningham.qxd:hora soltani.qxd

22/11/07

12:05

Page 133

Cunningham JB, McCrum-Gardner E. (2007) Power, effect and sample size using GPower: practical
issues for researchers and members of research ethics committees. Evidence Based Midwifery 5(4): 132-6

In other words, the probability of a Type 1 error is the


significance level of the test, denoted by alpha. If this
threshold is decreased to p=0.01, the risk of a Type 1
error is reduced, but this also increases the probability of
making a Type 2 error failure to reject the null hypothesis when it is false. This is the challenge faced by
researchers, and the principal factor influencing this is
sample size. A large sample size could give a statistically
significant result, but the effect size of the difference may
be negligible. Statistical significance can be bought by
having a large number of cases. Lenth (2001: 2) stated: It
is just as important, however, that the study not be too
big, where an effect of little scientific importance is nevertheless statistically detectable.
Beta errors, or Type 2 errors, are less frequently
discussed than Type 1 errors. The minimum probability
acceptable for a Type 2 error is usually given as 0.20,
which is when the probability of not rejecting a null
hypothesis that is false is two in ten. In order to find the
power of a test, the probability of a Type 2 error is
subtracted from one the probability of finding a difference of the magnitude specified, if indeed such differences
exist. If the power is set at 0.80, it means there is an eight
in ten chance of detecting a difference of the effect size
specified, if such differences are present. At a given alpha
level, the power of the test is increased by achieving a
larger sample size.
Ethics committees would be concerned about both underpowered studies (too few cases) and overpowered studies
(too many cases), but Bacchetti et al (2005: 108) suggests
that: The continuing conduct of underpowered studies is
not the dreadful moral lapse lamented by some writers.
Overpowered studies are considered by Bacchetti et al
(2005) as being a more legitimate area of ethical concern.
Why do the minimum values of alpha have a criterion
level of p=0.05? There is nothing technically significant
about this level it is based on judgement, and as Field
(2005: 25) points out: It forms the basis of modern statistics
and yet there is little justification for it other than Fisher
said so. Similarly, a beta level of 0.20 is also an arbitrary
figure based on judgement. Since a Type 2 error is considered less serious than a Type 1 error, a two in ten (0.20)
chance of such an error appears to be more acceptable.
As alpha and beta levels of probability have accepted
levels or minimum criterion levels, they present relatively
little difficulty when sample size is estimated. However,
effect size, which is also a factor in sample size estimation is
more problematic.

treatments could be shown to be significantly different,


but their clinical effects may be so small as to be unimportant. The effect size can be adjusted by the researcher
in terms of whether a small, medium or large effect size is
expected. But therein lies the problem the COREC
guidelines recommend that the effect size must not be set
too high, as this reduces the sample size estimate and so
increases the probability of a Type 2 error. Often, effect
size is a matter of clinical or research judgement. If a
small effect size is anticipated, then the sample size to
detect this will be larger than that for a moderate or large
effect size.
When researchers are asked the question What effect
size do you expect?, the response is often one of bewilderment. Lenth (2001: 5) suggests the opening question:
What results do you expect (or hope to see)? He also
points out that statisticians cannot answer this question
and that the answer requires the technical skills of the
statistician and the scientific knowledge of the researcher.
A useful discussion of effect size and the related issue of
sample size, alpha and beta levels of probability is provided by Devane et al (2004).
This poses a dilemma for researchers, statisticians and
ethics committees. Unless an effect size can be specified
and justified, then subsequent decisions on the ethical
issues of sample size may be difficult to resolve. Sample
size consideration cannot be evaluated by ethics committees unless the judgements made to calculate effect size are
first elaborated.
Cohen (1988) has set out standardised measures of
effect size. This was necessary as different metrics may
need to be used in tests. He proposed a simple categorisation of small, moderate and large effect size. Altman
(1981: 1137) called this a new simple method. Common
standardised effect sizes have been calculated here for
three commonly used tests based on an alpha probability
of 0.05 and power of 0.80 (see Table 1).
Table 1. Standardised effect sizes for common statistical
tests, alpha=0.05, power=0.8

Effect size
Effect size is a way of quantifying the difference between
two or more groups, or a measure of the difference in the
outcomes of the experimental and control groups. For
example, if one group has a new treatment and the other
has not (control group), then the effect size is a measure of
the effectiveness of the treatment.
Just because a result is statistically significant, this does
not mean it is substantive in effect. For example, two

Statistical test
Small
(symbol for effect size) effect

Medium
effect

Large
effect

Students t-test (d)

0.20

0.50

0.80

ANOVA fixed effects


one way (f)

0.10

0.25

0.40

Chi-square goodness
of fit (w)

0.10

0.30

0.50

Using standardised effect size simplifies sample size estimation, but it should not replace the need for sound
judgements on why a specific effect size is chosen.
At the risk of reductionism, sample size can be estimated from specified alpha and beta levels, and from the
specification of whether a small, moderate or large effect
size is anticipated.

2007 The Royal College of Midwives. Evidence Based Midwifery 5(4): 132-6

133

132 to 136_Cunningham.qxd:hora soltani.qxd

22/11/07

12:06

Page 134

Cunningham JB, McCrum-Gardner E. (2007) Power, effect and sample size using GPower: practical
issues for researchers and members of research ethics committees. Evidence Based Midwifery 5(4): 132-6

Sample size estimates have been calculated here (see


Table 2) for an independent sample t-test, ANOVA (four
group means) and chi-square test, with the probability of
alpha errors at 0.05, and powers of 0.80 and 0.90. All
tests are two-tailed and assume equal numbers in each
group, and the estimates of sample size were calculated
using GPower.
Table 2. Total sample size estimates for small, moderate
and large effect sizes, alpha=0.05
Test based on
Power
alpha level of
(1 beta)
p=0.05

Effect size
Small

Moderate

Large

0.80

Independent
sample t-test

788

128

52

1054

172

68

ANOVA
(four group
means)

1096

180

76

1424

232

96

Chi-square
(4 df a twoby-five table)

1194

133

48

1541

172

162

0.90
0.80
0.90
0.80
0.90

Using GPower to estimate sample size


GPower version three (Buchner et al, 1997) is an excellent
freeware program that allows high-precision power and
sample size analyses. It computes power values for given
sample sizes, effect sizes, and alpha levels (post hoc power
analyses), sample sizes for given effect sizes, alpha levels,
and power values (a priori power analyses), and alpha and
beta values for given sample sizes, effect sizes, and betaalpha ratios (compromise power analyses).
The sample sizes in this papers examples can be
checked using the GPower program, and simple examples
are given so that the process can be best understood. For
more complex design models using GPower, advice is
provided by Buchner et al (1997) and Camacho-Sandoval
(2007). The COREC (2007: 23) guidelines ask that sufficient information be given to allow the research ethics
committee to reproduce the calculation, and the use of
GPower can assist this process.
Examples
All of the tests that follow are based on two-tailed
hypotheses with equal numbers in each group. To illustrate sample size analysis using GPower, specific problems
have been selected.
t-tests
A researcher wishes to estimate the sample size required
to test the null hypothesis of no significant difference in
two mean scores between two independent groups (twotailed t-test). They set the alpha level probability as
p=0.05 and the beta level probability as 0.20. As the
power of the test is one minus the beta level, the power is
0.80. The researcher is not experienced enough to ascertain the effect size. Using GPower, they estimate the

134

sample sizes required for small, moderate and large effect


sizes. For a small effect size (d=0.20), they will need a
total sample size of 788, with an equal number (394) in
each group. For a moderate effect size (d=0.50), a total
sample size of 128 would be required. A large effect size
(d=0.80) reduces the total sample size required to 52. This
problem highlights the need for sound judgements on
effect size. A researcher submitting to an ethics committee
could choose a sample size of 52, 128 or 788 subjects.
How does an ethics committee assess the correctness of
the sample size?
A more specific example will demonstrate how the
effect size is calculated. A research group interested in the
healing effects of honey applied to wounds (experimental
group) are interested in the difference it will make
compared to a conventional treatment (control group).
They know from studies that the conventional treatment
results in a mean wound improvement of 10mm over the
period of study. In their judgement as clinicians, they
consider an improvement in healing of 1mm greater than
conventional treatment (mean=11mm) as being the
minimum expected improvement. The pooled standard
deviation* of the two groups is 3mm.
In this example, the effect size is (y x)/ = (11 10)/3
= 0.33, where y is the mean improvement score for the
experimental group, x is the mean improvement for the
conventional treatment and is pooled standard deviation
of the two groups. Using GPower, it is estimated that
146 cases would be needed in each of the control and
experimental groups if the alpha level is p=0.05 and the
power is 0.80.
Using GPower, sample size may be plotted against
power for selected effect sizes and alpha levels (see
Figure 1). Calculating sample sizes for three effect sizes
and a power ranging from 0.60 to 0.95, it can be seen that
for a small effect size (0.20), a total of approximately 800
cases would be required, assuming an alpha level of
p=0.05 and a power of 0.80.
In a similar example, Altman (1981: 1388) compares
milk-feeding effects on the growth of five-year-olds. He
*To illustrate pooled variance:
To evaluate a new treatment (group 1) against a conventional treatment
(group 2), researchers want to estimate the required sample size and find
the following normative data from two previous studies:
n1=30
n2=30

variance1= 9.2

standard deviation1= 3.03

variance2= 8.8

standard deviation2= 2.96

Pooled variance s2 =

(n1 1)s12 + (n2 1)s22


(n1 1) + (n2 1)

Where n1 is the number in group 1, n2 the number in group 2,


s12 the variance in group 1, and s22 the variance in group 2. The square
root of s2 is the combined standard deviation estimate.
Using the above values, pooled variance is calculated to be 9.51 and
the estimated standard deviation 3.03.
An approximate estimate of pooled variance is the mean of the two
variances from the sample.

2007 The Royal College of Midwives. Evidence Based Midwifery 5(4): 132-6

132 to 136_Cunningham.qxd:hora soltani.qxd

22/11/07

12:07

Page 135

Cunningham JB, McCrum-Gardner E. (2007) Power, effect and sample size using GPower: practical
issues for researchers and members of research ethics committees. Evidence Based Midwifery 5(4): 132-6

Figure 1. Effect sizes (d), power and total sample size for
in independent sample t-test, alpha=0.05, equal sample
sizes (from graphic drawn by GPower)
t-tests Means: Difference between two independent means (two groups)
Tail(s) = Two, Allocation ratio N2/N1 = 1, a err prob = 0.05

Total sample size

1200
1000

Effect size
= 0.2
= 0.25
= 0.3

800
600
400
200
0.6

0.7

0.8

0.9

Power

also points out that it is surely ethically indefensible to


carry out a study with only a small chance of detecting a
treatment effect unless it is a massive one, and with a
consequently high probability of failure to detect an
important therapeutic effect.
Analysis of variance, F-tests
In a study of four independent groups, the significance of
the differences in four sample means is being evaluated
using ANOVA. This type of design is known as a fixed
effects single factor design. The alpha level is p=0.05, the
probability of a beta error is 0.20 and the power is 0.80.
Using GPower, the total sample sizes for small, moderate
and large effect sizes are 1096, 180 and 76 subjects respectively. This was calculated for fixed effects single factor
design in GPower. Thus, 274 subjects would be needed in
each of the four groups for small effect size, 45 for moderate effect size and 19 for a large effect size.
To take a practical example, a researcher investigating
three treatments suspects that the population means will be
250, 300 and 350 units, with an overall mean of 300. They
wish to estimate the effect size of these differences.

This again shows the impact of effect size on sample size


estimation, illustrating that a principal challenge for
researchers and ethics committee alike is the issue of
expected effect size.
Chi-square test (goodness of fit)
In this example, a researcher plans to cross-tabulate
responses from a five-point Likert item by gender. This
would produce a two-by-five table with four degrees of
freedom. They wish to reject the null hypothesis of no
independence at p=0.5, with a power of 0.80. Using
GPower, they find that the totals needed for small, moderate and large effect sizes are 1194, 133 and 48 cases.
A more concrete example will demonstrate how the
effect size is calculated. An experiment investigates the
efficacy of two treatments on the reduction of anxiety.
When asked what proportion of the conventional therapy
(control) group will have reduced anxiety level, the
researcher knows from past studies that this will be 0.35
(0.175/0.50). They expect that the proportion in the new
treatment (experimental) group with reduced anxiety
levels will be 0.65 (0.325/0.50). The denominator 0.50 is
used because the experimental group represents half the
sample. The proportions expected under the research
hypotheses the minimum expected benefits of the experimental treatment may be calculated (see Table 3). The
researcher needs to know the standardised effect size of
this difference, as well as the sample size needed to test
this hypothesis.
Table 3. Judgement on a priori effect size for two treatments
Outcome

Control

Experiment

Outcome
marginal total

Positively beneficial

0.175

0.325

0.5

No benefit

0.325

0.175

0.5

Group marginal total

0.50

0.50

1.0

Cohens effect size (f) =

(j )2

j=1

k2error

Effect size (w) can be calculated using the formula:


k=4

Where j is the population mean for an individual group,


is the overall mean, k is the number of groups and the
error variance (error) is the within-group variance. In this
instance, the within-group standard deviation was 75
units. Applying the formula:
f=

(250 300)2 + (300 300)2 + (350 300)2


3(75)2
2500 + 0 + 2500

= 0.54
3(75)2
This represents a large effect size, and when sample size is
estimated using GPower, a total sample size of 39 cases is
found to be required 13 in each group. GPower can also
calculate the effect size directly from the means.
f=

Cohens effect size (w) =

(P1i P0i)2

i=1

P0i
Where k is the number of cells, P0i is the population
proportion in cell i under the null hypothesis and P1i is
the population proportion in cell i under the alternative
hypothesis. P0i for each cell is calculated by multiplying
the row marginal for the cell by the column marginal for
the respective cell, and dividing by the sum of probabilities for all cells which will always be one. Thus, for
cell1.1, P0i = 0.50 x 0.50/1 = 0.25. P0i is 0.25 for all cells.
w=

)+(

)+(

)(

(0.1750.25)2

(0.3250.25)2

(0.3250.25)2

0.25

0.25

0.25

(0.1750.25)2
0.25

w = 0.30

2007 The Royal College of Midwives. Evidence Based Midwifery 5(4): 132-6

135

132 to 136_Cunningham.qxd:hora soltani.qxd

22/11/07

12:07

Page 136

Cunningham JB, McCrum-Gardner E. (2007) Power, effect and sample size using GPower: practical
issues for researchers and members of research ethics committees. Evidence Based Midwifery 5(4): 132-6

The effect size is 0.30 (moderate effect size). Using


GPower, the total sample size is estimated as being 88
(44 in each group), with an alpha level of p=0.05 and a
power of 0.80, and using the chi-square test with one
degree of freedom.
Discussion
Estimating sample size at the proposal stage of the
research process can be idealised. Assumptions can be
built in to estimates, but the actual reality may be different the best laid schemes o mice an men / Gang aft agley [often go astray] (Burns, 1785). In the examples
shown, a priori assumption has been acted on levels
have been specified in advance of data collection, and
numbers of cases have been based on equal sizes per
group. There should also have been sound judgements
made for the effect size that is expected.
However, the reality is often something different than
that expected during the planning stages. Major difficulties
could include a low response rate, dropouts or mortality,
which may affect the equal-sized groups on which the
study is based. Since a basis of the a priori assumptions is
equal numbers per group, this may be violated.
For example, if an independent samples t-test is being
used to detect a moderate effect size for an alpha level f
p=0.5 and a power of 0.90, a total sample size of 172
would be required (86 per group). After the data have
been collected, if 86 cases are found in one group and 40
in the other, then this will have reduced the power of the
test. But what is the power for such numbers of cases?
GPower can estimate this using a post hoc analysis. If the
data for post hoc analysis are entered, the power of the test
is found to have been reduced to 0.74. There is therefore
an approximately seven-in-ten chance of detecting moderate effect size differences, if indeed such differences exist.

Of course, after ethical approval has been obtained, these


difficulties in meeting study objectives are often no longer
considered. The response rate needs to be estimated in
advance and a larger sample size selected accordingly.
Making judgements on the ethics of sample size requires
a basic knowledge of the fundamental concepts of
hypotheses testing and power. The one factor that needs
most elaboration is effect size. Manipulation of sample
size could result in ethical approval for studies that
wrongly specify moderate or large effect size, when the
effect size suggested in clinical or biological terms may
actually be small. With the increasing use of programs
such as GPower, it should become easier for researchers
and ethics committee members to evaluate claims of
sample size adequacy. The major issue they will face in
doing so will be the quality of the judgements upon which
effect size is based.
Conclusions
Researchers and research ethics committee members must
be familiar with the technicalities of sample size estimation if they are to make informed judgements on sample
size, power and related ethical issues. Alpha and power
levels can be pre-specified, but effect size is more problematic. Readers are invited to use GPower to check the
sample sizes that illustrate this paper, and to further
explore the features of this or similar programs. In doing
so, they may gain useful practical insights into the
balancing act of sample size and power analysis.
Researchers can use programs such as GPower for estimating sample size for ethics committees, but as with all
mathematically-based perspectives of the world, the fit of
the model is often a poor fit for reality. It is clinical
judgements that are the most important in this process,
though judgements cannot be demonstrated objectively.

References
Altman DG. (1981) Statistics and ethics in medical research: how large a
sample? Br Med J 281(6251): 1336-8.
Bacchetti P, Leslie E, Wolf LE, Segal MR, McCulloch CE. (2005) Ethics and
sample size. Am J Epidemiol 161(2): 105-10.
Brown BW, Brauner C, Chan A, Gutierrez D, Herson J, Lovato J, Polsley J,
Russell K, Venier J. (2000) BAM software download site: STPLAN software. The University of Texas MD Anderson Center: Houston, Texas. See:
http://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.asp
x?Software_Id=41 (accessed 18 September 2007).
Buchner A, Erdfelder E, Faul F. (1997) How to use GPower. Heinrich-HeineUniversitt: Dsseldorf. See: www.psycho.uni-duesseldorf.de/aap/projects/
gpower/how_to_use_gpower.html (accessed 20 November 2007).
Burns R. (1785) To a mouse: In: Noble A, Hogg P. (Eds.). (2003) The
Canongate Burns: the complete poems and songs of Robert Burns.
Canongate: Edinburgh.
Camacho-Sandoval J. (2007) GPower tutorial. Heinrich-Heine-Universitt:

guidance on NHS research ethics committee application. See:


http://83.138.142.202/corecformDocs/Question-Specific_Guidance.pdf
(accessed 11 June 2007).
Cohen J. (1988) Statistical power analysis for the behavioural sciences (second
edition). Lawrence Earlbaum Associates: Hillsdale, New Jersey.
Devane D, Begley C, Clarke M. (2004) How many do I need? Basic principles
of sample size estimation. JAN 47(3): 297-302.
Dupont WD, Plummer WD. (1997) PS power and sample size program available for free on the internet. Controlled Clin Trials 18(3): 274.
Field A. (2005) Discovering statistics using SPSS. Sage: London.
Kerrison S, Pollock AM. (2005) The reform of UK research ethics committees:
throwing the baby out with the bath water? J Medical Ethics 31(8): 487-9.
Lenth RV. (2001) Some practical guidelines for effective sample-size determination. University of Iowa: Iowa City, Iowa. See: www.stat.uiowa.edu/
techrep/tr303.pdf (accessed 18 September 2007).
National Patient Safety Agency. (2007) National Research Ethics Service: UK

Dsseldorf. See: www.psycho.uni-duesseldorf.de/aap/projects/gpower/

Ethics Committee Authority-recognised research ethics committees. See:

gpower-tutorial.pdf (accessed 18 September 2007).

www.nres.npsa.nhs.uk/applicants/help/contacts/recsrecognised.htm

Central Office for Research Ethics Committees. (2007) Question-specific

136

(accessed 18 September 2007).

2007 The Royal College of Midwives. Evidence Based Midwifery 5(4): 132-6

You might also like