You are on page 1of 40

Power Analysis for Traditional and Modern Hypothesis Tests

Kevin R. Murphy Pennsylvania State University

Power Analysis
Helps you plan better studies Helps you make better sense of existing studies Is not limited to traditional null hypothesis tests
Application of power analysis to minimum-effect tests will be discussed

Errors in Null Hypothesis Tests


True State of Affairs

No Effect (H0) Some Effect Reject Null


Your Decision Type I Error reject null when it is true ()

Power= 1-

Fail to Reject Null

Type II Error - fail to reject null when you should ()

Power Depends On
Effect Size
How large is the effect in the population?

Sample Size (N)


You are using a sample to make inferences about the population. How large is the sample?

Decision Criteria -
How do you define significant and why?

Power Analysis and the F Distribution


The power of most statistical tests in social sciences (e.g., ANOVA, regression, t-tests, other linear model statistics) can be evaluated via the familiar F distribution F is a ratio of observed effect to error
F= MS treatments / MS error

F = (True Effect + Error) / Error


The larger the true treatment effect, the larger F you expect to find If the null hypothesis is correct, E(F) = 1.0

How Does Power Analysis Work?


In the familiar F distribution below, 95% of the values are below 2.00 (distribution for df = 7,200)

Frequency

F=2.0 represents cutoff for rejecting H0

2 F Value

The Noncentral F Distribution


If the null hypothesis is false, the Noncentral F distribution is needed. In the Noncentral F distribution below, 75% of the values are below 2.00. Therefore, power = .25

Frequency

Central F Noncentral F

2 F Value

A Larger Effect
In the Noncentral F distribution below, in which the effect is larger, 30% of the values are below 2.00. Therefore power = .70

Frequency

Central F Noncentral F

2 F Value

Power Functions
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0 0. 2 0. 4 0. 6 0. 8 1

Likelihood of rejection H0

Effect Size

Power Functions
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
25 75 5 5 12 17 5 22 27 5

Likelihood of rejection H0

Sample Size

How to Increase Power


Increase N
Effects of adding more subjects are not identical to those of adding more observations Increase ES Choose a different research question Use stronger treatments or interventions Use better measures Use a more lenient alpha p<.05 is driven by force of habit, not necessarily by substantive concerns

Effects of Implementing Power Analysis


Stronger studies
Larger samples, better measures

Fewer studies
Adequate studies are harder to do than most people realize

Less emphasis, in the long term, on null hypothesis testing

Conducting a Power Analysis


The classic text in this field is still one of the best sources
Cohen, J. (1998). Statistical power analysis for the behavioral sciences (2nd Ed.). Erlbaum

More current (and more accessible) sources include


Lipsey, M. (1990). Design sensitivity. Sage Murphy, K. & Myors, B. (2004). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Erlbaum.

Conducting a Power Analysis


Power Analysis software
Power and Precision - Biostat
www.PowerAnalysis.com

One-Stop F Calculator
Included in Murphy & Myors (2004)

PASS - NCSS software


www.ncss.com/pass.html

Conducting a Power Analysis


In planning studies, you should
Assume relatively small effects
If it was reasonable to expect a large effect, you probably dont need to do the study or the test

Aim for power of .80 or better


Power of .50 means that significance tests have become a coin flip

Effect Size Conventions


In behavioral and social sciences, there are widelyfollowed conventions for describing small, moderate, and large effects d- standardized mean difference Small Moderate Large .20 .50 .80 Percentage of variance explained 1% 10% 25%

Applications of Power Analysis


Study planning - Given ES and , solve for N
If you wanted to compare the effects of four types of training programs and: You expected small to moderate effects (programs account for 5% of variation in performance) You use an level of .05

You need N=214 to achieve Power=.80

Applications of Power Analysis


Study evaluation - Given N and , solve for ES
If you wanted to compare the effects of four safety interventions and: You have 44 subjects available You use an level of .05 You will achieve Power=.80 only if the effects of interventions are truly large (accounting for 25% of the variance in outcomes)

Applications of Power Analysis


Making a rational choice regarding Given N and ES, solve for
If you wanted to compare the effects of two leadership development programs and: You have 200 subjects available You expect a small difference (d=.20, or 1% of the variance explained by programs) You will achieve Power=.64 using = .0 5 You will achieve Power=.37 using = .0 1

Moving Beyond Traditional Significance Testing


Traditional null hypotheses tests are the focus of most power analyses These tests are deeply flawed, and there is relatively little research on the power of alternatives Minimum effect tests represent one useful alternative

Nil Hypothesis Testing


Testing the hypothesis that treatments, interventions, etc. have no effect (Nil Hypothesis Test - NHT) is most common and least useful thing social and behavioral scientists do Two problems loom largest: Confusion over Type 1 errors Likelihood of rejecting the null hypothesis eventually reaches 1.0, regardless of the research question

Type I Errors are Very Rare


Type I error - reject H0 when it is true If H0 is never true, it is impossible to make a Type I error If H0 is very unlikely, a Type I error is even less likely H0 - treatment had NO effect at all H1 - SOMETHING happened

Most things we do to minimize Type I errors lead to more Type II errors

This Implies
Large literature on protecting yourself from Type I errors is not really useful NHTs yield one of two outcomes confirm the obvious reject H0, which you already know is likely to be wrong confuse you accept H0 even though you know it is likely to be wrong

In NHT, All You Need in N


As N increases, the likelihood of rejecting the nil hypothesis approaches 1.0 Power to reject H0 does not depend all that much on the phenomenon
if N is big enough you will reject H0 if N it is small enough, you wont

Significance tests are an indirect index of how many subjects showed up

There Must be a Better Way


Stop doing significance tests (e.g., Schmidt, 1992) Confidence intervals (e.g., APA Task Force, American Psychologist, August, 1999) Bayesian methods (e.g., Rounet, Psychological Bulletin, 1996)

There Must be a Better Way


Minimum-Effect Tests
Test the hypothesis that something nontrivial happened Murphy, K. & Myors, B. (2003) Statistical power analysis: A simple and general model for traditional and modern hypothesis tests: 2nd Ed. Mahwah, NJ: Erlbaum. Murphy, K. & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234-248.

Minimum-Effect Tests
H0 - treatments have a negligible effect (e.g., they account for 1% or less of the variance) H1 - the effect of treatments is big enough to care about
This approach addresses the two biggest flaws of traditional tests H0 really is plausible. Treatments rarely have zero effect but they often have negligible effects Increasing N does not automatically increase likelihood of rejecting H0

Minimum-Effect Tests
With Minimum Effect Tests (METs)
Type I errors are once again possible, but can be miminized the question asked in MET is no longer trivial you can actually learn something by doing the test Power Analysis work exactly the same way in MET as in NHT

Performing Minimum-Effect Tests


Put your test statistics in a simple, common form
e.g. F

Decide what you mean by a negligible effect Find or create an F table based on that definition of a negligible effect - Noncentral F distribution Proceed as you would for any traditional NHT

Working with the Noncentral F


Calculating or deriving noncentral F distributions was once a daunting task Many simple calculators now available
http://calculators.stat.ucla.edu/cdf/ncf/ncfcalc.php

Noncentrality parameter ( )
in a measure of effect size = [dfh * (MSh - MSe )] / MSe

What Constitutes a Negligible Effect ?


Standards for negligible effects depend on the research area and on the consequences of decisions Aspirin use accounts for very little variance in heart attacks, but the use of aspirin saves thousands of lives at minimal cost In personnel selection, it is relatively easy to account for a large proportion of the variance in performance with simple cognitive tests, so the increase in effectiveness that is defined as negligible might be larger

Defining a Negligible Effect


Effect Size conventions are useful, but by themselves may not be sufficient Consequences of errors must also be considered d- standardized mean difference Small Moderate .20 .50 Percentage of variance explained 1% 10%

Power Analysis for MET: Small Effect - d=.20, PV=.01


1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0 0. 2 0. 4 0. 6

Likelihood of rejection H0

Effect Size

0. 8

Power Analysis for MET: Small Effect - d=.20, PV=.01


0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
25 75 5 5 12 17 5 22 27 5

Likelihood of rejection H0 given population d=.30

Sample Size

Power Analysis for MET: Small Effect - d=.20, PV=.01


1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0 0. 2 0. 4 0. 6

Likelihood of rejection H0

Effect Size

0. 8

Power Analysis for MET: Small Effect - d=.20, PV=.01


0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
0 0. 05 0. 25 0. 15 0. 1 0. 2

Likelihood of rejection H0

Effect Size

Errors in MET
The potential downsides of MET are:
Type I errors could actually occur Lower power than corresponding NHT

You can reduce Type I errors by using larger samples The loss of power is more than balanced by the fact that the hypothesis being tested is not a trivial one

Type I Error Rates of MinimumEffect Tests


0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
0 0. 05 0. 15 0. 25 0. 1 0. 2

Smaller Sample Larger Sample

Effect Size

Type I vs Type II Errors


The tradeoff between Type I and Type II errors is more complicated in METs than in Nil tests In MET, alpha is precise only if the true effect size is exactly the same as your definition of negligible Type II errors more of a problem with METs METs are less powerful than NHTs (it is easier to reject the hypothesis that nothing happened than the hypothesis that nothing important happened), but this is not necessarily a bad thing METs place even greater premium on large samples, but small samples cause problems even where there is substantial power

Examples - comparing two treatments


N needed
Nil MET
(1%=negligible)

True effect
PV=.05 149 375 PV=.10 79 117

You might also like