You are on page 1of 15

Chapter 19

Statistics in Experimental Stroke Research: From Sample


Size Calculation to Data Description and Significance
Testing
Ulrich Dirnagl

Abstract
Experimental stroke researchers take samples from populations (e.g., certain mouse strains) and make
inferences about unknown parameters (e.g., infarct sizes, outcomes). They use statistics to describe their
data, and they seek formal ways to decide whether their hypotheses are true (Compound X is a neuropro-
tectant). Unfortunately, experimental stroke research at present lacks statistical rigor in designing and
analyzing its results, and this may have negative consequences for its predictiveness. This chapter aims at
giving a general introduction into the dos and donts of statistical analysis in experimental stroke research.
In particular, we will discuss how to design an experimental series and calculate necessary sample sizes, how
to describe data with graphics and numbers, and how to apply and interpret formal tests for statistical
significance. A surprising conclusion may be that there are no formal ways of deciding whether a hypoth-
esis is correct or not and that we should focus instead on biological (or clinical) significance as measured
in the size of an effect and on the implications of this effect for the biological system or organism. Good
evidence that a hypothesized effect is real comes from replication across multiple studies; it cannot be
inferred from the result of a single statistical test!

Key words Box plot, Categorical data, Confidence interval, Descriptive statistics, Null hypothesis,
Power, p-Value, Sample size, Scatter plot, Statistical significance, Type I/II error, Wilcoxon-Mann-
Whitney test

1 Introduction

Experimental stroke researchers produce data: infarct volumes, cell


counts, functional outcomes, and such. Collecting data means sam-
pling from a population. Since the population (e.g., all SV 129 mice)
is too numerous to be sampled, we collect a sample to represent the
population, and we use our sample (e.g., 20 SV 129 mice) to make
inferences about unknown parameters of the population: infarct
sizes, the effect of treatment with Compound X, etc. Descriptive
statistics help us to do just this. But very often stroke researchers also
want an objective method to decide whether an observation from

Ulrich Dirnagl (ed.), Rodent Models of Stroke, Neuromethods, vol. 120,


DOI 10.1007/978-1-4939-5620-3_19, Springer Science+Business Media New York 2016

301
302 Ulricyh Dirnagl

a sample justifies accepting a hypothesis about the population. In


other words, scientists want to test hypotheses and make decisions!
In statistical terms, asking whether Compound X reduces infarct
sizes is equivalent to asking whether the two samples (control vs.
Compound X) come from the same population. Unfortunately,
there is a certain likelihood that our decision might be wrong, as our
test statistic could result in a false positive or a false negative. To get
a handle on these errors, we calculate p-values for levels of signifi-
cance and power. P-values, in particular, p = 0.05, have become the
graven image of modern biomedicine. Most stroke researchers
believe that the p-value of a significance test is the probability that
the research results are due to chance (i.e., they confuse it with the
positive predictive value, PPV) and are completely unaware of the
concept of statistical power. It is the overall aim of this chapter to
guide the preclinical stroke researcher in the proper experimental
planning and statistical evaluation of data. Before presenting an
example of a mock study, from hypothesis to significance testing, a
few concepts need to be clarified. The following section should serve
as a primer in statistics for the experimental stroke researcher.

2 Concepts

2.1 Descriptive Types of Data in Stroke Research


Statistics Data might be numerical, such as infarct volume data, or categori-
cal, such as functional score data.
1. Numerical variables tell us how much and how many? and
can be either continuous (e.g., infarct volumes) or discrete
(e.g., hemorrhagic transformation count). Numerical variables
can be treated with mathematical operations. Numerical sum-
maries of quantitative data can be used to describe the location
(e.g., mean) and the spread (e.g., variance) of the data, and
statistical analysis may use parametric tests (e.g., t-test,
ANOVA), provided the data is normally distributed.
2. Categorical variables tell us what type? and can be nomi-
nal (unordered, e.g., male, female) or ordinal (ordered,
e.g., Bederson outcome score: 0, 1, 2, 3). A categorical
variable places an individual into one of several categories.
Mathematical operations make no sense with them. They
can be numerically summarized with a count or with rela-
tive frequencies. Categorical variables can be visualized
with dot plots or bar charts; frequency tables such as the
chi-square test are used for group comparisons if they are
unordered, or the nonparametric Wilcoxon-Mann-Whitney
testalso known as Mann-Whitney U testis used if they
are unpaired and ordered.
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 303

A very common error in preclinical stroke research is the use of


parametric statistics (mean, standard deviation (SD), t-test, etc.)
on categorical data, for example, the use of the Bederson score [1]
to assess global neurological function or a score to grade histologi-
cal damage in the hippocampus [2]. As a consequence, many arti-
cles on experimental cerebral ischemia contain faulty statistics and
may have reached unfounded conclusions.

Measures of Variance
These describe the variability of observations and serve as a useful
basis for describing data in terms of probability: standard deviation
(SD), standard error of the mean (SEM), confidence intervals
(CI), etc use SDs to describe the spread of your data.
An extremely useful but regrettably underused measure of vari-
ance in the experimental stroke literature is the confidence interval
(CI, e.g., 95 %: mean 2 standard error of the mean (SEM) if n > 30).
Its virtue lies in its straightforward interpretation: if we were to repeat
the experiment a great many times calculating an interval each time,
95 % of those intervals would contain the mean. Confidence intervals
reflect different degrees of precision in the measurement (Fig. 1b, d).
The SEM is an estimate of the SD that would be obtained
from the means of a large number of samples drawn from that
population. In other words, a SEM is a measure of the precision of
an estimate of a population parameter, but it should not be used to
describe the variability of the data. SEMs have become popular as
a means of data description since they lead to small whiskers
around the mean and are enticingly suggestive of a low variability
of the data. It is for good reason, however, that most journals in
their instructions to authors ban the use of SEM data description.
Unfortunately, statistical consultants flooded with papers ignoring

200 200 200 200

175
A 175
B 175
C 175
D
Infarct volume mm3

150 150 150 150

125 125
*
125 125

100 100 100 100

75 75 75 75

50 50 50 50
CompX control CompX control CompX control CompX control

Fig. 1 Examples of noninformative (a), acceptable (b), and optimal (c, d) graphical displays of numerical infarct
volume data in stroke research. (a) Typical but uninformative display as bar graphs of means + SEM. (b) Same
data, 95 % confidence interval. (c) Same data, displayed as dot plot of the raw data, overlaid by box-and-
whisker plot (median, first and third percentile, range). (d) Same data, displayed as dot plot of the raw data,
overlaid by mean 95 % confidence interval. Note that the bar graph in A obscures the dispersion of the data,
while C and D allow the viewer a complete assessment
304 Ulricyh Dirnagl

this rule seem to have given up maintaining it. Nevertheless, dont


uses SEMs to quantify the variance of your stroke data!

Graphing Stroke Data


At the heart of most experimental stroke papers are graphs which
summarize outcome in various groups (Fig. 1a). Although time
honored, this practice is actually not recommended. For one, SEMs
should not be used for data presentation (see above). Secondly, dis-
playing only the mean is the least informative option available.
Rather, whenever possible a graph should include each data point as
a scatter plot (see Fig. 1c, d). This is the most informative way of
displaying data. The scatter plot should be accompanied by a graph
displaying summary statistics and valid measures of variance. One
very useful graph is the box-and-whisker plot, which is very helpful
in interpreting the distribution of data (see Fig. 1b). A box-and-
whisker plot provides a graphical summary of a set of data based on
the quartiles of that data set: quartiles are used to split the data into
four groups, each containing 25 % of the measurements.
Unfortunately, Microsoft Excel does not have a built-in box-and-
whisker plot, but practically all scientific graphing or statistics pack-
ages do. Various web pages show how to make box-and-whisker
plots in Excel, e.g., [3]. Alternatively, a mean 95 % confidence
interval can be plotted in or next to the scatter plot (Fig. 1d).
Completely unacceptable is the use of bar graphs displaying
means SDs or SEMs for categorical data (see above) (Fig. 2a). To
display categorical data, use scatter plots or stacked bar graphs
instead (see Fig. 2). Why is it incorrect to represent categorical
variables by mean a measure of variance? Categorical variables
may be ordered, but the intervals between values are not equally
spaced, confounding mathematical operations. As a consequence,
interim values between categories normally do not have a meaning.
For example, in the modified Bederson score [1], category 1 is
given for weakness of the contralateral forepaw, and category 2 for
circling. What would be the meaning of the score of 1.6 0.4, as
often seen in papers of the experimental stroke literature?

2.2 Test Statistics Hypothesis-driven experimental stroke research, like most areas of
basic biomedicine, is trying to formalize the acceptance or rejection
of its hypotheses with frequentist statistical tests, so-called null
hypothesis statistical testing (NHST). The foundations for statistical
testing were laid at the beginning of the twentieth century by William
Sealy Gosset, working under the pseudonym Student (hence the
Students t-test). NHST was further developed and popularized
more than 80 years ago by the statistician, geneticist, and eugenicist
R. A. Fisher as a conveniently mechanical procedure for statistical
data analysis. While the indiscriminate use of NHST has been criti-
cized for many decades [46] and while several alternatives have been
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 305

4 10

A 9 B CompX
control

3
* 8
3
7
3
6

Count
Score

2 5 6
4
6
3
1 5
2
3
1 2 2
0 0
0 1 2 3 4
CompX control
Score

Fig. 2 Examples of erroneous (a) and correct (b) analysis and graphical display of categorical functional out-
come data in stroke research. (a) It is not correct to analyze categorical data with parametric statistics. Shown
here: bar graphs of mean outcome score (+ SEM). (a) t-test results in an erroneous statistically significant
difference (* = p = 0.006). (b) Correct display of the same categorical data presented as frequencies (counts)
with stacked bars. The Wilcoxon-Mann-Whitney test results in a p = 0.01. Note the substantially different
p-values and the implication that the wrong analysis might reveal significant differences, whereas the correct
one does not (or vice versa)

proposed, e.g., Bayesian approaches [5], its convenience and seem-


ingly easy (but in most cases false) interpretation have made it a de
facto standardsome say a cult [7]for any statistical testing in
experimental preclinical medicine, including in stroke research.
Statisticians know that there is no objective ritual for finding out
whether a hypothesis is true or false. However, since without NHST
it is very unlikely that an experimental stroke paper will get published,
we need to understand it and use it carefully and correctly.

Null Hypothesis and Errors


At the heart of NHST lies the insight that if we test hypotheses by
sampling from large populations, we may err in concluding whether
our hypotheses are true or false. In this process we can make two
types of errors: one is to accept our hypothesis when in fact it is false;
the other is to reject it when it is actually correct. To get a good
mathematical grip on these errors, we can project our reasoning
onto a null hypothesis. If a stroke researcher holds the hypothesis
that Compound X reduces infarct sizes, his/her hypothesis is sup-
ported by rejecting another hypothesis that Compound X has no
effect! More formally expressed, the null hypothesis (H0) states that
there is no difference between the two groups, while the alternative
hypothesis (H1) states that there is a difference, which is what we
usually state in our biological hypothesis. Now two errors may occur:
306 Ulricyh Dirnagl

We may reject H0, but H0 was true (i.e., a false positive), which
we call a type I error. Type I errors are quantified by (very
often set to 0.05; see below).
We might accept H0, but H1 was true (i.e., a false negative),
which we call a type II error. Type II errors are quantified by
(very often ignored, sometimes set to 0.2; see below).
The key criticism of NHST is that it is a trivial exercise because
H0, which basically states that two treatments or samples are identi-
cal, is always false, and rejecting it is merely a matter of carrying out
enough experiments (i.e., having enough power, see below) [8].

Alpha and the p-Value


The appeal of this approach is that it seems to offer a formal way of
deciding whether our hypothesis is correct or not. Conveniently,
we might accept an error level (e.g., 0.05), which is presently
interpreted by most (stroke) researchers to represent an index of
the importance or size of a difference or relation. In other words,
statistical significance appears to imply theoretical or practical sig-
nificance. This seemingly straightforward interpretation of , and
the ease with which it is calculated through spreadsheets and statis-
tics packages even without understanding its true meaning, under-
pins the popularity of NHST testing and p-values in biomedical
research and may have often clouded a more (neuro)biologically
driven interpretation of the data.
But what is the true meaning of P < 0.05? If we were to repeat
the analysis many times using new data each time and if the null
hypothesis were really true, then on only 5 % of those occasions
would we (falsely) reject it. To understand the consequences of this
statement, we have to first understand the concept of the type II
error (see below).
Does need to be 0.05? R. A. Fisher, at a time when far fewer
hypotheses were tested and papers published, suggested the magi-
cal 0.05 level for , which seems to be universally accepted by now
and is regarded by most students to be almost a natural constant.
There is no theoretical justification for a particular value of ,
which some journals even state in their guidelines for authors. An
example from the American Physiological Society illustrates this:
If you plan a study in the hopes of finding an effect that could
lead to a promising scientific discovery, then = 0.10 is appropri-
ate. Why? When you define to be 0.10, you increase the proba-
bility that you find the effect if it exists! In contrast, if you want to
be especially confident of a possible scientific discovery, then
= 0.01 is appropriate. [9].

Beta and Statistical Power


As stated above, quantifies the type II error, i.e., of falsely accept-
ing H0 (false negative). Power is simply derived as 1. Power is
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 307

the conditional probability of accepting H1 when it is true. It is


analogous to the concept of resolving power in evaluating optical
instruments. It tells us how likely it is that we can find an effect if
it equals or exceeds a prespecified size. Power increases with effect
size (e.g., reduction in infarct size): the larger the difference
between the parameters tested, the greater the power to detect it.
Increasing sample size decreases the standard error, thereby
increasing power. Conversely, a large variance (i.e., SD) will
decrease power. There is an inverse relation between and :
increasing alpha increases power (=decreases ) but also increases
the risk of rejecting H0 when it is actually true.
Many papers in experimental stroke research report normalized
effect sizes of about 1 (normalized effect size = effect/SD); for
example, a reduction of infarct size by 30 mm3 results in a normal-
ized effect size of 1 if the SD was also 30 mm3. A plot of total sample
size (i.e., number of animals) vs. power demonstrates that 54 ani-
mals are needed to achieve a power of 0.95 (equivalent to = 0.05)
at an level of 0.05. Targeting the same levels for and means that
we are as afraid of false positives as of false negatives. The plot reveals
that at 20 animals (10 per group, a conservative estimate of group
sizes prevalent in preclinical stroke research), power is 0.55 (Fig. 3).
The power of throwing coins is 0.5! In other words, with respect to
statistical power a large fraction of the experimental stroke literature
is no more predictive than a game of chance [11].

55

50

45
Total sample size

40

35

30

25

20

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Power (1- err prob)

Fig. 3 Graphical display of sample size calculation for two-sided independent t-test using the freeware pro-
gram G-Power [10]. set at 0.05, effect size (mean difference between groups/SD) set at 1. Total sample size
vs. power. Note that to detect an effect size as large as 1, SD at = 0.05, and = 0.95, a total sample size of
54 (i.e., group sizes of 27) is required
308 Ulricyh Dirnagl

Most experimental stroke researchers have a misconception of


statistical power. They may even argue that if they have obtained a
statistically significant p-value (e.g., p < 0.05), there is no need to
waste time worrying about type II errors, as they have not accepted,
but rather rejected H0! For various reasons this is a fallacy with
potentially calamitous impact. When H0 is in fact false (the drug
really works, the knockout mouse has a phenotype, etc.), the over-
all error rate is not the level, but the type II error rate [12]. It
is impossible to falsely conclude that H0 is false when in fact it is
false! This apparently trivial insight may have important conse-
quences for the interpretation of experimental data in stroke
research and ultimately for its translation to the patient: if is high
(i.e., statistical power is low), the probability of being able to
reproduce data decreases more and more. For example, at a power
of 0.5, which is not uncommon in experimental stroke research
(see above), the probability of being able to replicate the findings
of a study stands at 50 % [13]!
Excellent treatments of the concepts of power, sample size cal-
culation, and positive predictive value can be found in [1416].

Are Most Published Research Findings False?


This is what was famously claimed in a provocative paper by Ioannidis
[17]. What appears silly and outrageous at first sight contains an
unpleasant truth.
It was stated above that measures the likelihood of falsely
rejecting H0 when it is really true, while measures the likelihood
of falsely accepting H0 when H1 is really true. To be able to inter-
pret and as likelihoods for our hypothesis being false or correct,
we would need to know whether H0 or H1 is correct in the first
place, which of course would render statistics superfluous! What
we can learn from this is that the interpretation of and depends
on the prior probability of our hypothesis (e.g., Compound X
reduces infarct sizes in a stroke model). Because of the grave impli-
cations of this seemingly innocent proposition, I will illustrate it
with three examples adapted from the literature.

Example 1 (Adapted from [5])


An experiment is conducted to see whether thoughts can be trans-
mitted from one subject to another. Subject A is presented with a
shuffled deck of 50 cards and tries to communicate to subject B
whether each card is red or black by thought alone. In the experi-
ment, subject B correctly gives the color of 33 cards. The null
hypothesis is that no thought transference takes place and subject B
is randomly guessing. The observation of 33 correct is significant,
with a (one-sided) p-value of 0.035. Would you now believe that
subject A can transmit thoughts to subject B? Probably not, as you
correctly recognize that much stronger evidence would be required
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 309

to reject a highly plausible H0. This makes clear that the p-value
cannot mean the same in all situations. But why would you not hesi-
tate to accept a neuroprotective effect of Compound X at p = 0.035?

Example 2 (Adapted from [5])


You take a diagnostic test for a grave disease. The test is very accu-
rate; it has a 0.1 % false positive rate (very good sensitivity) and a
0.1 % false negative rate (very good specificity). The test result
comes out positive! What is the probability that you have the dis-
ease? Most students (and faculty) will answer 99.9 %. But the infor-
mation that was missing to answer the question is the incidence of
the disease in the population. If it is, for example, 1:50,000, a test
which is false positive at a rate of 1:1000 will result in a probability
of having the disease of only 2 % (=1:50). H0 (you do have the
disease) started with a very low probability! In this instructive
example, sensitivity is equivalent to type I error, and specificity is
equivalent to type II error, while the incidence of disease can be
interpreted to be the rate of true hypotheses. It is wrong to inter-
pret a p-value as the probability of H0, because this fails to take
account of the prior probability of H0.

Example 3 (Adapted from [6, 18])


Let us assume that 10 % of all H0 in biomedicine are in fact false,
that is, 90 % of all hypotheses are incorrect (assumption 1).
Although a wild guess, it is backed by evidence from several fields
[17]. Let us also assume an of 0.05 and a power of 0.5 (assump-
tion 2), which is a reasonable estimate for the current experimental
stroke literature (see above). If we now test 1000 hypotheses,
assumption 1 results in 100 studies in which H0 is truly false.
When power is 0.5 (assumption 2), we reject H0 in 50 % of those
studies. In the 900 studies in which treatment does not work
(because our hypothesis was wrong, i.e., H0 was true), we rejected
H0 in 45, since was 0.05 (assumption 2). In 45 of the 95 studies
with a significant result, H0 was true; in other words, the studies
were false positives! The likelihood that a significant result is true
in this example was 52 %. This example is a warning to those who
argue that once they have obtained a significant result (p < 0.05),
they need not worry about power, let alone the prior probability of
the hypothesis they were testing (Table 1).

Statistical Significance Testing: A Sizeless Science


Before moving on to a practical example, let us briefly contemplate
the main criticism of the concept of statistical significance. We
need to be aware of the fact that even if we apply significance test-
ing properly (i.e., recognizing and ), it contains an intrinsic
weakness. Statistical significance testing substitutes exploring the
existence of a phenomenon or the precision by which we measure
310 Ulricyh Dirnagl

Table 1
Number of times we accept and reject null hypothesis, under plausible assumptions regarding the
conduct of medical research (adapted from [6])

Hypothesis wrong (=H0 true) Hypothesis correct (=H0


Result of exp. treatment does not work false) treatment works Total
Nonsignificant (i.e., negative) 855 50 905
result (=accept H0)
Significant (i.e., positive) result 45 50 95
(=reject H0)
Total 900 100 1000

it, for its magnitude. We should ask What difference does the
effect make? and Who cares?, but instead are concerned with
formal proof of existence and precision of measurement. Increasing
sample size is a tedious but nevertheless unconditionally successful
exercise to demonstrate statistically significant difference. As Ziliak
and McCloskey [7] have put it: through statistical testing, science
has become sizeless. Gosset was already aware of its shortcomings
at the beginning of the twentieth century and warned against its
unintelligent use. As it stands, exercising radical criticism of sta-
tistical significance and shunning significance testing will not help
get scientific papers published. But, thus compelled to use it, we
need to use it intelligently, cognizant of its limitations. Beyond
statistical significance testing, we should:
Focus on biological (or clinical) significance, which is mea-
sured in the size of an effect, and the implications the effect has
for the biological system or organism
Be aware that good evidence that a hypothesized effect is
real comes from replication across multiple studies and cannot
be inferred from the result of a single statistical test!

Exploratory Versus Confirmatory Preclinical Stroke Studies


Regarding study design and statistical analysis, an important dis-
tinction between two different modes of investigation should be
made [19]. Most preclinical stroke researchers are confronted with
two overarching agendas: selecting interventions or possible mech-
anisms amid a vast field of potential candidates and producing
rigorous evidence of clinical promise for a small number of inter-
ventions developed further in human trials. Hence, two different,
complementary modes of investigation can be discriminated: in the
first (exploratory investigation), stroke researchers should aim at
generating robust pathophysiological theories of disease. In the
second (confirmatory investigation), researchers should aim at
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 311

demonstrating strong and reproducible treatment effects in animal


models relevant to human stroke. Each mode entails different
study designs, confronts different validity threats, and supports dif-
ferent kinds of inferences. Many preclinical stroke studies appear to
operate in both modes simultaneously, aiming to report novel
pathophysiological mechanisms while at the same time presenting
evidence for novel treatment. This is highly problematic, as an
exploratory study sets out to find what might work in an almost
limitless space of mechanisms, molecules, dosages, etc., while a
confirmatory study needs to weed out false positives to further
develop only robust strategies in the human. Exploratory studies
may therefore put a premium on sensitivity, that is, be forgiving
with respect to type I errors. On the other hand, confirmatory
studies need high specificity, but may be forgiving with respect to
type II errors. Table 2 lists how relevant design and analysis fea-
tures compare in exploratory and confirmatory preclinical studies.

Preclinical Randomized Controlled Trials (pRCTs)


In this context a novel type of preclinical study has to be men-
tioned, which could be considered the most extreme variant of a
confirmatory animal study: pRCTs, many features of which are
adapted from clinical randomized controlled trials, may help to

Table 2
Comparison of design and analysis features in exploratory and confirmatory studies

Exploratory discovery Confirmatory


Hypothesis + +++
Establish pathophysiology +++ +
Sequence and details of experiments established at onset + +++
Primary endpoint ++
Sample size calculation + +++
Blinding +++ +++
Randomization +++ +++
External validity (aging, comorbidities, etc.) ++
In-/exclusion criteria ++ +++
Test statistics + +++
Preregistration ++
High sensitivity (high type I error rate, low type II error +++ +
rate): Find what might work
High specificity (low type I error rate, high type II error + +++
rate): Weed out false positives
312 Ulricyh Dirnagl

overcome the problem of single laboratories to conduct experi-


ments of sufficient power and robustness to justify the expensive
and potentially hazardous transition from animal to human stud-
ies. A pRCT includes several hundred animals, harmonizes proto-
cols, and features a prospective, blinded, and randomized study
design, central monitoring of data quality, as well as centralized
analysis and data deposition [20].

Further Reading
Excellent general treatments of the statistical analysis of preclinical
studies can be found in [2124]. For further in depth discussion
on many of the issues treated above, readers are directed to my sci-
ence blog [25], which covers topics such as reproducibility, replica-
tion, statistics, power, bias, and bench to bedside translation,
among others.
Let us now look at a practical example of an experimental
stroke study.

3 A How To Example

Before we start conducting stroke experiments, we need to:


Formulate the underlying biological hypothesis.
Lay down the experimental design with which we aim to prove
or disprove the hypothesis.
1. Hypothesis: In our example, the simple hypothesis, based
on prior evidence (e.g., in cell culture) and knowledge of
the mechanism(s) of action, is that Compound X is a neu-
roprotectant, decreases infarct sizes, and improves out-
come in experimental stroke.
2. We now design the experiment according to SOPs: We
decided to test Compound X in a mouse model of stroke.
Infarct sizes will be measured volumetrically (Chap. 15)
72 h after 45 min of middle cerebral artery occlusion
(MCAO; Chap. 4). Compound X or saline will be applied
intraperitoneally 2 h after MCAO. Inclusion and exclusion
criteria, details of randomization, blinding, etc., are defined
a priori; in other words, we are following our standard
operating procedures (see Chap. 17).
3. We now need to predefine the statistical tests we are going
to apply. This needs to be done apriori. The infarct volume
data will be numerical, but it is unclear whether it will be
distributed normally. Assuming normality (and testing for
it after data acquisition), we might compare the two inde-
pendent group means with a two-tailed t-test (or for more
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 313

than two groups, with an ANOVA and appropriate post


hoc contrast). To be on the safe side with respect to nor-
mal distribution, we opt for the nonparametric Mann-
Whitney U test (or if there are more than two variables,
the equivalent Kruskal-Wallis one-way analysis of variance).
As stated above, we will accept a type I error level of
0.05. As outcome test we have opted for a modification of
the simple Bederson score [1, 26, 27], which produces the
following categories: 0 = no deficit, 1 = weakness of the
contralateral forepaw, 2 = circling, 3 = loss of righting reflex,
and 4 = no motor activity. Since our score is a categorical
variable, we will test for statistical significance between
both groups with a Wilcoxon-Mann-Whitney test [28].
4. The next step is to calculate the necessary group (sample)
sizes. From previous experiments we know that in the
model we are using, SDs are around 25 % of the mean
infarct volume. If you have not had previous experience,
you need either to conduct a pilot series (recommended) or
use data from similar experiments in the literature. If we
want to detect a change in infarct volume greater than 25 %
by Compound X, this corresponds to an effect size of 1
(effect size d = (mean infarct volumecontrolmean infarct vol-
umetreatement)/pooled SD [29]). If we set power at 0.8 and
at 0.05 (see above), the required sample size (two tailed)
calculated a priori will be 17 animals per group. All major
statistical packages will allow you to calculate sample sizes
and power. One very convenient and intuitive freeware
program is G*Power [10], which I highly recommend.
To calculate sample sizes for nonparametric tests such as
the ones we perform to analyze behavioral testing after
stroke (see above), we can either use expensive, specialized
software such as nQuery or PASS, or we can use G*Power
(or any other program that can do power calculations). To
do this, we find an approximate size by estimating the sam-
ple size for a t-test and then adjusting this size based on the
asymptotic relative efficiency (ARE) of the nonparametric
test [30]. So, for example, to estimate sample size for a
Wilcoxon-Mann-Whitney test (see above), we compute
the sample size needed for a two-sample t-test and then
adjust the sample size based on the ARE of the Wilcoxon-
Mann-Whitney test relative to the t-test. To avoid compli-
cated assumptions about the underlying distribution, one
could examine a worst case scenario, since the ARE for the
Wilcoxon-Mann-Whitney test is never less than 0.864.
Thus, in this conservative and simple approach, estimate
the sample size for a t-test (see above) and divide by 0.864.
For example, to detect a change in the Bederson score of
314 Ulricyh Dirnagl

the magnitude given in the example of Fig. 2b, at a power


of 0.8 and of 0.05, a total sample size of approximately
35 (=30/0.864) would be required.
5. After the series of experiments has been conducted and the
data evaluated by investigators blinded to treatment alloca-
tion, we first produce graphical displays of the data in the
form of scatter plots (see Fig. 1c, d) and describe the data by
calculating means and SDs (or even better, confidence inter-
vals) for numerical data (infarct volumes) and contingency
tables for the categorical data (outcome score), which is
graphically represented by stacked bar graphs (Fig. 2b).
6. We then perform the prespecified statistical tests. At this
stage no other statistical tests should be used, except for
hypothesis generation for future experiments. It is a com-
mon practice but a misuse of statistics to play with vari-
ous tests or post hoc contrasts until the desired p-value
is produced by the statistics package (p-hacking [31]).
Similarly, we are not allowed to add animals to a series of
predefined group sizes to target a certain p-value. This may
be tempting, particularly when the resulting p-values only
narrowly miss the significance level.
7. After all these calculations for significance and plotting of
asterisks (p < 0.05), we should try to forget them for a
moment and answer the relevant question: what is the bio-
logical significance of the difference observed?
8. Finally, a word of caution and a recap of what was said
above, if the results of our example proved to be statisti-
cally significant (p slightly below 5 %), at a power of 80 %,
and an a priori probability of the hypothesis of 10 %, the
false discovery rate (the chance to have obtained a false
positive result) would be close to 40 % [32], not 5 %.

References

1. Bederson JB, Pitts LH, Tsuji M, Nishimura 4. Harlow LL, Mulaik SA, Steiger JH (eds)
MC, Davis RL, Bartkowski H (1986) Rat mid- (1997) What if there were no significance tests?
dle cerebral artery occlusion: evaluation of the Lawrence Erlbaum Associates, London
model and development of a neurologic exami- 5. OHagan A, Luce R (2003) A primer on Bayesian
nation. Stroke 17:472476 statistics in health economics and outcomes
2. Buchan A, Pulsinelli WA (1990) Hypothermia research. https://www.shef.ac.uk/polopoly_
but not the N-methyl-D-aspartate antagonist, fs/1.80635!/file/primer.pdf. 29 Dec 2015
MK-801, attenuates neuronal damage in ger- 6. Sterne JAC, Smith GD (2001) Sifting the evi-
bils subjected to transient global ischemia. dence whats wrong with significance tests?
J Neurosci 10:311316 Br Med J 322:226231
3. Hunt H. Boxplots in Excel. http://staff. 7. Ziliak ST, McCloskey DN (2008) The Cult of
unak.is/not/andy/StatisticsTFV0708/ Statistical Significance: How the Standard
Resources/BoxplotsInExcel.pdf. 29 Dec Error Costs Us Jobs, Justice, and Lives. Univ
2015 of Michigan Press, Ann Arbor
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 315

8. Kirk RE (1996) Practical significance: a con- Simoni MG, Dirnagl U, Grittner U, Planas
cept whose time has come. Edu Psychol Meas AM, Plesnila N, Vivien D, Liesz A (2015)
56:746759 Results of a preclinical randomized con-
9. Curran-Everett D, Benos DJ (2004) Guidelines trolled multicenter trial (pRCT): Anti-
for reporting statistics in journals published by CD49d treatment for acute brain ischemia.
the American Physiological Society. Am Sci Transl Med 7:299ra121
J Physiol 28(3):8587 21. Motulsky HJ (2015) Common misconceptions
10. G*Power 3. http://www.gpower.hhu.de/. 29 about data analysis and statistics. J Pharmacol
Dec 2015 Exp Ther 351:200205
11. Dirnagl U (2006) Bench to bedside: the quest 22. Festing MF, Nevalainen T (2014) The design
for quality in experimental stroke research. and statistical analysis of animal experiments:
J Cereb Blood Flow Metab 26:14651478 introduction to this issue. ILAR
12. Schmidt FL, Hunter JE (1997) Eight common J 55:379382
but false objections to the discontinuation of 23. Nuzzo R (2014) Scientific method: statistical
significance testing in the analysis of research errors. Nature 506(7487):150152
data. In: Harlow L, Mulaik SA, Steiger JH 24. Aban IB, George B (2015) Statistical consider-
(eds) What if there were no significance tests? ations for preclinical studies. Exp Neurol
Lawrence Erlbaum Associates, London, 270:8287
pp 3764 25. Dirnagl U. To infinity and beyond. http://
13. Mulaik SA, Raju NS, Harshman RA (1997) dirnagl.com. 29 Dec 2015
There is a time and a place for significance testing. 26. Li X, Blizzard KK, Zeng Z, DeVries AC, Hurn
In: Harlow L, Mulaik SA, Steiger JH (eds) What PD, McCullough LD (2004) Chronic behav-
if there were no significance tests? Lawrence ioral testing after focal ischemia in the mouse:
Erlbaum Associates, London, pp 66115 functional recovery and the effects of gender.
14. Button KS, Ioannidis JP, Mokrysz C, Nosek Exp Neurol 187:94104
BA, Flint J, Robinson ES, Munaf MR (2013) 27. Plesnila N, Zinkel S, Le DA, Amin-Hanjani S,
Power failure: why small sample size under- Wu Y, Qiu J, Chiarugi A, Thomas SS, Kohane
mines the reliability of neuroscience. Nat Rev DS, Korsmeyer SJ, Moskowitz MA (2001)
Neurosci 14:365376 BID mediates neuronal cell death after oxy-
15. Krzywinski M, Altman N (2013) Points of sig- gen/glucose deprivation and focal cerebral
nificance: power and sample size. Nat Methods ischemia. Proc Natl Acad Sci U S A
10:11391140 98:1531815323
16. Simonsohn U (2015) Small telescopes: detect- 28. Emerson JD, Moses LE (1985) A note on the
ability and the evaluation of replication results. Wilcoxon-Mann-Whitney test for 2 x k ordered
Psychol Sci 26:559569 tables. Biometrics 41:303309
17. Ioannidis JP (2005) Why most published 29. Cohen J (1988) Statistical power analysis for
research findings are false. PLoS Med the behavioral sciences, 2nd edn. Lawrence
2(8):e124 Erlbaum Associates, Mahwah, NJ
18. Oakes M (1986) Statistical inference. 30. Simon S. Sample size for the Mann-Whitney U
Chichester: Wiley test. http://www.pmean.com/00/mann.html.
19. Kimmelman J, Mogil JS, Dirnagl U (2014) 29 Dec 2015
Distinguishing between exploratory and con- 31. Head ML, Holman L, Lanfear R, Kahn AT,
firmatory preclinical research will improve Jennions MD (2015) The extent and conse-
translation. PLoS Biol 12:e1001863 quences of p-hacking in science. PLoS Biol
20. Llovera G, Hofmann K, Roth S, Salas- 13:e1002106
Prdomo A, Ferrer-Ferrer M, Perego C, 32. Colquhoun D (2014) An investigation of the
Zanier ER, Mamrak U, Rex A, Party H, Agin false discovery rate and the misinterpretation of
V, Fauchon C, Orset C, Haelewyn B, De p-values. R Soc Open Sci 1:140216

You might also like