Professional Documents
Culture Documents
Abstract
Experimental stroke researchers take samples from populations (e.g., certain mouse strains) and make
inferences about unknown parameters (e.g., infarct sizes, outcomes). They use statistics to describe their
data, and they seek formal ways to decide whether their hypotheses are true (Compound X is a neuropro-
tectant). Unfortunately, experimental stroke research at present lacks statistical rigor in designing and
analyzing its results, and this may have negative consequences for its predictiveness. This chapter aims at
giving a general introduction into the dos and donts of statistical analysis in experimental stroke research.
In particular, we will discuss how to design an experimental series and calculate necessary sample sizes, how
to describe data with graphics and numbers, and how to apply and interpret formal tests for statistical
significance. A surprising conclusion may be that there are no formal ways of deciding whether a hypoth-
esis is correct or not and that we should focus instead on biological (or clinical) significance as measured
in the size of an effect and on the implications of this effect for the biological system or organism. Good
evidence that a hypothesized effect is real comes from replication across multiple studies; it cannot be
inferred from the result of a single statistical test!
Key words Box plot, Categorical data, Confidence interval, Descriptive statistics, Null hypothesis,
Power, p-Value, Sample size, Scatter plot, Statistical significance, Type I/II error, Wilcoxon-Mann-
Whitney test
1 Introduction
301
302 Ulricyh Dirnagl
2 Concepts
Measures of Variance
These describe the variability of observations and serve as a useful
basis for describing data in terms of probability: standard deviation
(SD), standard error of the mean (SEM), confidence intervals
(CI), etc use SDs to describe the spread of your data.
An extremely useful but regrettably underused measure of vari-
ance in the experimental stroke literature is the confidence interval
(CI, e.g., 95 %: mean 2 standard error of the mean (SEM) if n > 30).
Its virtue lies in its straightforward interpretation: if we were to repeat
the experiment a great many times calculating an interval each time,
95 % of those intervals would contain the mean. Confidence intervals
reflect different degrees of precision in the measurement (Fig. 1b, d).
The SEM is an estimate of the SD that would be obtained
from the means of a large number of samples drawn from that
population. In other words, a SEM is a measure of the precision of
an estimate of a population parameter, but it should not be used to
describe the variability of the data. SEMs have become popular as
a means of data description since they lead to small whiskers
around the mean and are enticingly suggestive of a low variability
of the data. It is for good reason, however, that most journals in
their instructions to authors ban the use of SEM data description.
Unfortunately, statistical consultants flooded with papers ignoring
175
A 175
B 175
C 175
D
Infarct volume mm3
125 125
*
125 125
75 75 75 75
50 50 50 50
CompX control CompX control CompX control CompX control
Fig. 1 Examples of noninformative (a), acceptable (b), and optimal (c, d) graphical displays of numerical infarct
volume data in stroke research. (a) Typical but uninformative display as bar graphs of means + SEM. (b) Same
data, 95 % confidence interval. (c) Same data, displayed as dot plot of the raw data, overlaid by box-and-
whisker plot (median, first and third percentile, range). (d) Same data, displayed as dot plot of the raw data,
overlaid by mean 95 % confidence interval. Note that the bar graph in A obscures the dispersion of the data,
while C and D allow the viewer a complete assessment
304 Ulricyh Dirnagl
2.2 Test Statistics Hypothesis-driven experimental stroke research, like most areas of
basic biomedicine, is trying to formalize the acceptance or rejection
of its hypotheses with frequentist statistical tests, so-called null
hypothesis statistical testing (NHST). The foundations for statistical
testing were laid at the beginning of the twentieth century by William
Sealy Gosset, working under the pseudonym Student (hence the
Students t-test). NHST was further developed and popularized
more than 80 years ago by the statistician, geneticist, and eugenicist
R. A. Fisher as a conveniently mechanical procedure for statistical
data analysis. While the indiscriminate use of NHST has been criti-
cized for many decades [46] and while several alternatives have been
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 305
4 10
A 9 B CompX
control
3
* 8
3
7
3
6
Count
Score
2 5 6
4
6
3
1 5
2
3
1 2 2
0 0
0 1 2 3 4
CompX control
Score
Fig. 2 Examples of erroneous (a) and correct (b) analysis and graphical display of categorical functional out-
come data in stroke research. (a) It is not correct to analyze categorical data with parametric statistics. Shown
here: bar graphs of mean outcome score (+ SEM). (a) t-test results in an erroneous statistically significant
difference (* = p = 0.006). (b) Correct display of the same categorical data presented as frequencies (counts)
with stacked bars. The Wilcoxon-Mann-Whitney test results in a p = 0.01. Note the substantially different
p-values and the implication that the wrong analysis might reveal significant differences, whereas the correct
one does not (or vice versa)
We may reject H0, but H0 was true (i.e., a false positive), which
we call a type I error. Type I errors are quantified by (very
often set to 0.05; see below).
We might accept H0, but H1 was true (i.e., a false negative),
which we call a type II error. Type II errors are quantified by
(very often ignored, sometimes set to 0.2; see below).
The key criticism of NHST is that it is a trivial exercise because
H0, which basically states that two treatments or samples are identi-
cal, is always false, and rejecting it is merely a matter of carrying out
enough experiments (i.e., having enough power, see below) [8].
55
50
45
Total sample size
40
35
30
25
20
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Power (1- err prob)
Fig. 3 Graphical display of sample size calculation for two-sided independent t-test using the freeware pro-
gram G-Power [10]. set at 0.05, effect size (mean difference between groups/SD) set at 1. Total sample size
vs. power. Note that to detect an effect size as large as 1, SD at = 0.05, and = 0.95, a total sample size of
54 (i.e., group sizes of 27) is required
308 Ulricyh Dirnagl
to reject a highly plausible H0. This makes clear that the p-value
cannot mean the same in all situations. But why would you not hesi-
tate to accept a neuroprotective effect of Compound X at p = 0.035?
Table 1
Number of times we accept and reject null hypothesis, under plausible assumptions regarding the
conduct of medical research (adapted from [6])
it, for its magnitude. We should ask What difference does the
effect make? and Who cares?, but instead are concerned with
formal proof of existence and precision of measurement. Increasing
sample size is a tedious but nevertheless unconditionally successful
exercise to demonstrate statistically significant difference. As Ziliak
and McCloskey [7] have put it: through statistical testing, science
has become sizeless. Gosset was already aware of its shortcomings
at the beginning of the twentieth century and warned against its
unintelligent use. As it stands, exercising radical criticism of sta-
tistical significance and shunning significance testing will not help
get scientific papers published. But, thus compelled to use it, we
need to use it intelligently, cognizant of its limitations. Beyond
statistical significance testing, we should:
Focus on biological (or clinical) significance, which is mea-
sured in the size of an effect, and the implications the effect has
for the biological system or organism
Be aware that good evidence that a hypothesized effect is
real comes from replication across multiple studies and cannot
be inferred from the result of a single statistical test!
Table 2
Comparison of design and analysis features in exploratory and confirmatory studies
Further Reading
Excellent general treatments of the statistical analysis of preclinical
studies can be found in [2124]. For further in depth discussion
on many of the issues treated above, readers are directed to my sci-
ence blog [25], which covers topics such as reproducibility, replica-
tion, statistics, power, bias, and bench to bedside translation,
among others.
Let us now look at a practical example of an experimental
stroke study.
3 A How To Example
References
1. Bederson JB, Pitts LH, Tsuji M, Nishimura 4. Harlow LL, Mulaik SA, Steiger JH (eds)
MC, Davis RL, Bartkowski H (1986) Rat mid- (1997) What if there were no significance tests?
dle cerebral artery occlusion: evaluation of the Lawrence Erlbaum Associates, London
model and development of a neurologic exami- 5. OHagan A, Luce R (2003) A primer on Bayesian
nation. Stroke 17:472476 statistics in health economics and outcomes
2. Buchan A, Pulsinelli WA (1990) Hypothermia research. https://www.shef.ac.uk/polopoly_
but not the N-methyl-D-aspartate antagonist, fs/1.80635!/file/primer.pdf. 29 Dec 2015
MK-801, attenuates neuronal damage in ger- 6. Sterne JAC, Smith GD (2001) Sifting the evi-
bils subjected to transient global ischemia. dence whats wrong with significance tests?
J Neurosci 10:311316 Br Med J 322:226231
3. Hunt H. Boxplots in Excel. http://staff. 7. Ziliak ST, McCloskey DN (2008) The Cult of
unak.is/not/andy/StatisticsTFV0708/ Statistical Significance: How the Standard
Resources/BoxplotsInExcel.pdf. 29 Dec Error Costs Us Jobs, Justice, and Lives. Univ
2015 of Michigan Press, Ann Arbor
Statistics in Experimental Stroke Research: From Sample Size Calculation to Data 315
8. Kirk RE (1996) Practical significance: a con- Simoni MG, Dirnagl U, Grittner U, Planas
cept whose time has come. Edu Psychol Meas AM, Plesnila N, Vivien D, Liesz A (2015)
56:746759 Results of a preclinical randomized con-
9. Curran-Everett D, Benos DJ (2004) Guidelines trolled multicenter trial (pRCT): Anti-
for reporting statistics in journals published by CD49d treatment for acute brain ischemia.
the American Physiological Society. Am Sci Transl Med 7:299ra121
J Physiol 28(3):8587 21. Motulsky HJ (2015) Common misconceptions
10. G*Power 3. http://www.gpower.hhu.de/. 29 about data analysis and statistics. J Pharmacol
Dec 2015 Exp Ther 351:200205
11. Dirnagl U (2006) Bench to bedside: the quest 22. Festing MF, Nevalainen T (2014) The design
for quality in experimental stroke research. and statistical analysis of animal experiments:
J Cereb Blood Flow Metab 26:14651478 introduction to this issue. ILAR
12. Schmidt FL, Hunter JE (1997) Eight common J 55:379382
but false objections to the discontinuation of 23. Nuzzo R (2014) Scientific method: statistical
significance testing in the analysis of research errors. Nature 506(7487):150152
data. In: Harlow L, Mulaik SA, Steiger JH 24. Aban IB, George B (2015) Statistical consider-
(eds) What if there were no significance tests? ations for preclinical studies. Exp Neurol
Lawrence Erlbaum Associates, London, 270:8287
pp 3764 25. Dirnagl U. To infinity and beyond. http://
13. Mulaik SA, Raju NS, Harshman RA (1997) dirnagl.com. 29 Dec 2015
There is a time and a place for significance testing. 26. Li X, Blizzard KK, Zeng Z, DeVries AC, Hurn
In: Harlow L, Mulaik SA, Steiger JH (eds) What PD, McCullough LD (2004) Chronic behav-
if there were no significance tests? Lawrence ioral testing after focal ischemia in the mouse:
Erlbaum Associates, London, pp 66115 functional recovery and the effects of gender.
14. Button KS, Ioannidis JP, Mokrysz C, Nosek Exp Neurol 187:94104
BA, Flint J, Robinson ES, Munaf MR (2013) 27. Plesnila N, Zinkel S, Le DA, Amin-Hanjani S,
Power failure: why small sample size under- Wu Y, Qiu J, Chiarugi A, Thomas SS, Kohane
mines the reliability of neuroscience. Nat Rev DS, Korsmeyer SJ, Moskowitz MA (2001)
Neurosci 14:365376 BID mediates neuronal cell death after oxy-
15. Krzywinski M, Altman N (2013) Points of sig- gen/glucose deprivation and focal cerebral
nificance: power and sample size. Nat Methods ischemia. Proc Natl Acad Sci U S A
10:11391140 98:1531815323
16. Simonsohn U (2015) Small telescopes: detect- 28. Emerson JD, Moses LE (1985) A note on the
ability and the evaluation of replication results. Wilcoxon-Mann-Whitney test for 2 x k ordered
Psychol Sci 26:559569 tables. Biometrics 41:303309
17. Ioannidis JP (2005) Why most published 29. Cohen J (1988) Statistical power analysis for
research findings are false. PLoS Med the behavioral sciences, 2nd edn. Lawrence
2(8):e124 Erlbaum Associates, Mahwah, NJ
18. Oakes M (1986) Statistical inference. 30. Simon S. Sample size for the Mann-Whitney U
Chichester: Wiley test. http://www.pmean.com/00/mann.html.
19. Kimmelman J, Mogil JS, Dirnagl U (2014) 29 Dec 2015
Distinguishing between exploratory and con- 31. Head ML, Holman L, Lanfear R, Kahn AT,
firmatory preclinical research will improve Jennions MD (2015) The extent and conse-
translation. PLoS Biol 12:e1001863 quences of p-hacking in science. PLoS Biol
20. Llovera G, Hofmann K, Roth S, Salas- 13:e1002106
Prdomo A, Ferrer-Ferrer M, Perego C, 32. Colquhoun D (2014) An investigation of the
Zanier ER, Mamrak U, Rex A, Party H, Agin false discovery rate and the misinterpretation of
V, Fauchon C, Orset C, Haelewyn B, De p-values. R Soc Open Sci 1:140216