03 Statistical & Internal Validity

Statistical Conclusion Validity
IGS
1
The validity of inferences about the

correlation (covariation) between
treatment X and outcome Y
2
Glossary
Population parameter: A fixed feature of a population (e.g., the

population mean, the population standard deviation, . . .); we
conventionally indicate parameters using greek letters (, , , . . .)
Sample statistic: A feature that varies from one sample to another
(e.g., the sample average, the sample standard deviation, . . .)
Estimator: Any function of sample data used to estimate
parameters
Expectation: The mathematical expectation of a variable
indicated as is the population average of this variable
When a sample statistic has expectation equal to the corresponding
population parameter, its said to be an unbiased estimator of that
parameter
3
Formal Statistical Inference
The process of drawing conclusions about a

population based on sample data
Practical questions
How much uncertainty is associated with sample
data?
Do my results constitute strong evidence or just a
lucky draw/chance finding?
4
Formal Statistical Inference (contd)
If we select at random a sample of units from a population of

units, every possible sample of size has the same chance of selection
!
= possible samples are equally likely
! !
Example: if we select 3 units at random from a population of 8 units,

56 samples are equally likely
! 8! 87654321
= = = 56
! ! 3! 5! (3 2 1)(5 4 3 2 1)
5
The Mean
For a given population

Only one [ ] (parameter)
Many sample averages = that
depend on

What units are drawn
6
Unbiasedness of the Sample Mean
If we were to draw infinitely many random samples,

the average of the resulting sample means would be
the population mean
= [ ]
7
Variability of the Sample Mean
Sampling variance
Population Sample Sample mean

1
=
2
= = 1
2
Variance = 2 ( )2 =
1
( ) =1
=1 2
=

Std. Dev. =

SE summarizes the
variability in an estimate due
to random sampling 8
Estimated standard error
The population standard deviation is usually

unknown and must be estimated by replacing
with
9
T-statistic for the sample mean
Under the working/null hypothesis
A t-statistic for the sample mean is

= =

10
T-statistic for the sample mean (contd)
If the null hypothesis is
= = 0
A t-statistic for the sample mean is
11
Central limit theorem
If = then, as long as the sample is large

enough, has a sampling distribution that is
close a standard normal distribution (mean of 0 and
standard deviation of 1), irrespective of the
population distribution of
In other words, for large samples, the distribution
of a t-statistic is independent of the distribution of
the underlying data
12
Distribution of a t-statistic
13
Hypothesis testing
With standard normal variables, the

frequency of values larger than 2 is about
5%
Any t-statistic larger than 2 in absolute value
is too unlikely to be consistent with the null
hypothesis We reject the null
14
Confidence interval
If we repeatedly drew infinitely many independent

samples from the same population
, + 2
2
would include the population mean about 95% of

the time
15
Confidence Level
Confidence level: the percentage of all possible independent

samples from the same population that can be expected to
include the true population parameter
If we repeatedly drew infinitely many independent samples
from the same population and we calculated a confidence
interval for each sample, then a certain percentage
(confidence level) of the intervals would include the
unknown population parameter
Confidence intervals are usually calculated so that this
percentage is 95%, but we can produce 90%, 99%, 99.9% (or
whatever) confidence intervals for the unknown parameter
16
Comparison of Two Group Averages
1 = [ = 1
0 = [ = 0
0 : 1 0 = = 0
1 0 1 0
= =
1
0
1 1
+
1 2
17
Significance vs. Effect magnitude
A large t-statistic may be due to
A large effect size
Or a small estimated standard error
18
Null Hypothesis Significance Testing
(NHST)
The null hypothesis (0 ) is a claim to be tested,
usually an hypothesis of no difference (e.g., no
difference between test scores in group A and
group B)
The alternative hypothesis (1 ) is the one we
would believe if the null hypothesis is rejected
Rejecting 0 does not prove 0 to be false nor 1 to be
true
The only way 0 can be proven false (or true) is to know
the value of the population parameter(s) specified in the
null hypothesis; sample data do not provide that kind of
information
19
p-value
The probability of getting the observed

or more extreme results if the null
hypothesis were true
Following Fisher (1926), we usually say that
results are statistically significant if p < .05
(arbitrary)
20
More on NHST
DECISION
Do not reject H0 Reject H0
Correct Decision Type I Error

(FALSE POSITIVE)
H0 is true Prob = 1 Prob =
(significance)
TRUTH Type II Error Correct Decision

(FALSE NEGATIVE)
H0 is false Prob = Prob = 1
(power)
21
More on NHST (contd)
= =
= 1 = 1
22
1. Do X and Y covary?
Type I error (false positive): We may incorrectly
conclude that X and Y covary when they do not
Type II error (false negative): We may incorrectly
conclude that X and Y do not covary when they do
2. How strongly do X and Y covary?
We can over/underestimate
The magnitude of covariation
The degree of confidence that magnitude estimate
warrants
23
Threats to Statistical
Conclusion Validity
24
1. Low Statistical Power
An insufficiently powered experiment may

incorrectly conclude that the relationship between
treatment and outcome is not significant (Shadish et
al. 2002, 55)
= 1 = 1
The ability of a test to detect relationships that exist in
the population
The probability that a statistical test will reject the null
hypothesis when it is false
25
1. Low Statistical Power (contd)
Low power Larger estimated SE Wider

confidence intervals
Common practice to set = .20 Power = .80
Important to increase power when missing a real effect
would have negative consequences, e.g. when testing for
harmful effects of a new drug
Low power is a problem when effect sizes are small
Remedy: Meta-analysis
Comprehensive list of remedies: Table 2.3 SKC
26
Factors affecting power

A. Sample size: The larger the sample
size, the higher the power (see figure
1) Remedy:
Increasing sample size (sometimes
expensive/difficult)
27
Figure 1: The relationship between sample size and power for
H0: = 75, real = 80, one-tailed = 0.05, for 's of 10 and 15.
n
Source: Lane 2015
28
Factors affecting power

B. Standard deviation (SD): The smaller
the SD, the higher the power (see
figure 1) Remedies:
Sampling from a homogeneous
population
Reducing random measurement error
29
Figure 1: The relationship between sample size and power for
H0: = 75, real = 80, one-tailed = 0.05, for 's of 10 and 15.
n
Source: Lane 2015
30
Factors affecting power (contd)

C. Effect size (i.e. difference between
hypothesized and true parameter):
Easier to detect larger effects (see
figure 2)
31
Figure 2. The relationship between and power for
H0: = 75, one-tailed = 0.05, for 's of 10 and 15
Source: Lane 2015

32
Factors affecting power (contd)

D. Significance level (): The lower the
(i.e. the probability of Type I
error/false positive), the lower the
power (see figure 3)
33
Figure 3. The relationship between significance level and
power with one-tailed test: H0: = 75, real = 80, and = 10.
n
Source: Lane 2015
34
2. Violated Assumptions of the Test Statistics
Violations of statistical test assumptions can lead to

either overestimating or underestimating the size and
significance of an effect (Shadish et al. 2002, 55)
Example: If we ignore the hierarchical/multivel
structure of the data (e.g., soccer players nested
within teams, students nested within classes), we
may severely underestimate standard errors and
conclude that effects that might be ascribed to
chance are real (i.e. a higher risk of Type I error)
35
3. Fishing and the Error Rate Problem
Repeated tests for significant relationships, if uncorrected for

the number of tests, can artifactually inflate statistical
significance (Shadish et al. 2002, 55)
If the nominal = .05, the actual = .923 when the test is
repeated fifty times (Maxwell & Delaney 1990)
Examples:
Fishing until we find a significant effect
Multiple researchers analyzing the same data
Remedy:
Bonferroni correction: divides the target by the number of tests and uses
the Bonferroni-corrected in all individual tests
Bonferroni and other corrections may be too conservative in low-powered studies (high
risk of Type II error)
36
4. Unreliability of Measures
Measurement error weakens the relationship between two variables

and strengthens or weakens the relationship between three or more
variables (Shadish et al. 2002, 55)
With three or more variables, unreliability of measures can lead to
either false positives or false negatives what does that mean?
Particularly problematic in longitudinal studies that assess change
over time
Remedies:
Increasing the number of measurements
More items to measure the same concept
Multiple raters
Improving the quality of measures
Using validated scale items
Training for raters
Techniques like latent variable modelling
37
4. Unreliability of Measures (contd)
Structural equation modeling (SEM): family of

statistical modeling techniques (e.g., confirmatory
factor analysis, path analysis) to test theoretical
models
Two main components
Measurement model: uses observed variables (e.g.,
survey items) to define latent constructs (e.g., happiness,
self-efficacy, intelligence)
Structural regression model: system of simultaneous
regression equations to estimate paths linking the latent
constructs
38
5. Restriction of Range
Reduced range on a variable usually weakens the

relationship between it and another variable (Shadish et al.
2002, 55)
Small range Lower power
This problem can affect either the
Independent variable (IV). Example:
Comparing two similar treatments Remedy: using different
treatment doses and even full-dose vs. no treatment
Dependent variable (DV). Examples:
Dummies
Floor effects (respondents cluster near the bottom)
Ceiling effects (respondents cluster near the top)
Remedy: Using models that are appropriate for limited
39
variables (e.g., Tobit, truncated regression, Heckman)
6. Unreliability of Treatment Implementation
If a treatment that is intended to be implemented in a

standardized manner is implemented only partially for some
respondents, effects may be underestimated compared with
full implementation (Shadish et al. 2002, 55)
Common in field experiments
It usually decreases effect size, but it can also increase the
effect size when implemention is tailored to the recipients
Important to measure all components of the treatment
package
40
7. Extraneous Variance in the Experimental
Setting
Some features of an experimental setting may inflate error,
making detection of an effect more difficult (Shadish et al.
2002, 55)
Example: Fire drill or concert downstairs during lab
experiment
Particularly frequent in field experiments
When sources of extraneous variance cannot be controlled,
we should measure them and include them in the statistical
analysis
41
8. Heterogeneity of Units
Increased variability on the outcome variable within conditions

increases error variance, making detection of a relationship more
difficult (Shadish et al. 2002, 55)
Heterogeneity of respondents on an outcome variable increases
standard deviations on that variable and on any other correlated with
it Weaker treatment effect
Remedies
Sample units that are similar on characteristics correlated with outcome
Potential risks:
Lower external validity
Limited range on DV
Measure respondent characteristics that interact with a cause-effect
relationship and use them for blocking or as covariates
Within-participants designs comparing pre- and post-test scores for each
participant
42
9. Inaccurate Effect Size Estimation
Some statistics systematically overestimate or

underestimate the size of an effect (Shadish et al.
2002, 55)
Examples
Outliers (departing from normal distribution) can
dramatically decrease effect sizes
Analyzing binary outcomes with effect size measures
intended for continuous variables (correlation coefficient
or standardized mean difference statistic)
Underestimation of effect size
43
Internal Validity
44
Internal Validity
The validity of inferences about

whether observed covariation
between X (the presumed treatment)
and Y (the presumed outcome) reflects
a causal relationship from X to Y as
those variables were manipulated or
measured
45
Internal Validity (contd)
Local molar causal validity (Campbell 1986)

Local: Causal conclusions are limited to the context
of the particular treatments, outcomes, times,
settings, and persons studied
Molar: Treatments are a complex package
consisting of many components, all of which are
tested as a whole
46
Threats to Internal Validity
47
Threats to Internal Validity
Typically, we infer from an effect to a cause by

eliminating other possible causes (Mackie 1974, p.
67)
Threats to internal validity are those other possible
causes
Different threats are not necessarily independent
48
1. Ambiguous Temporal Precedence
Lack of clarity about which variable occurred first may yield

confusion about which variable is the cause and which is the
effect (Shadish et al. 2002, 55)
Correlational studies are often unable to answer the
question: Which came first, the chicken or the egg?
Not always: e.g., unlikely that an increase in the sales of
airconditioners increases outside temperature
Particularly tricky because some causation is bidirectional
(reciprocal)
High performance Self-efficacy Higher performance
49
2. Selection
Systematic differences over conditions in respondents

characteristics that could also cause the observed effect
(Shadish et al. 2002, 55)
Example
A new drug is given only to patients who volunteer to take the new
treatment
The volunteering patients might differ from nonvolunteers in ways
(e.g., sicker, older, etc.) that might affect the outcome
Random assignment eliminates selection bias because
randomly formed groups differ only by chance
50
3. History
[External] Events occurring concurrently with

treatment could cause the observed effect (Shadish
et al. 2002, 55)
Example: A study of psychotherapy with depressed
patients at the time a new antidepressant went on
the market
51
4. Maturation
Naturally occurring changes over time could be

confused with a treatment effect (Shadish et al.
2002, 55)
While maturation is internal, a natural course of
things having to do with some quality of the
participants in the study, history has to do with an
external event of some kind
Example: We may think that an ineffective medicine
works because patients get better by themselves
52
5. Regression Artifacts
When units are selected for their extreme scores,

they will often have less extreme scores on other
variables, an occurrence that can be confused with a
treatment effect (Shadish et al. 2002, 55)
Test theory Every measure has
A true score component reflecting a true ability
Plus a random error component that is normally and
randomly distributed around the mean of the
measure
53
5. Regression Artifacts (contd)
High scores will tend to have more positive random error pushing
them up, low scores will tend to have more negative random error
pulling them down
On the same measure at a later time, or on other measures at the
same time, the random error is less likely to be so extreme
Examples
A compensatory tutoring program for kids in the lowest 10 percent on a
pretest will seem more effective than it actually is because those kids will
tend to improve anyway in the post-test
People tend to go to psychotherapy after a shock and organizations tend to
hire consultants after a downturn; clients measured progress is partly a
movement back toward their stable mean as the temporary shock grows less
acute
54
6. Attrition/Mortality
Loss of respondents to treatment or to measurement can produce

artifactual effects if that loss is systematically correlated with
conditions (Shadish et al. 2002, 55)
A special subset of selection bias occurring after the treatment is in
place
Unlinke selection bias, attrition is not controlled by random
assignment
Example
If those dropping out of a compensatory tutoring course are the low pretest
test scorers, by the end of the course the participants who remain will be the
ones with higher academic skills
By comparing the average pretest to posttest scores we would overestimate
the effect of the course
55
7. Testing
Exposure to a test can affect scores on subsequent

exposures to that test, an occurrence that can be
confused with a treatment effect (Shadish et al.
2002, 55)
Only in pretest-posted designs
Example: People commonly improve on
standardized tests such as intelligence tests, SATs, or
GREs, due to practice, familiarity or other forms of
reactivity
56
8. Instrumentation
The nature of a measure may change over time or conditions

in a way that could be confused with a treatment effect
(Shadish 2002, 55)
Only in pretest-postest designs
Whereas testing involves a change in the participant,
instrumentation involves a change in the instrument
Examples
The spring on a bar press might become weaker and easier to push
over time
Schools often use two different types of tests before and after a
compensatory tutoring course (to reduce the testing threat); if the
level of difficulty is not the same between the two tests, part or all of
any pre-post difference is due to the change in instrument, not to the
course
57
9. Additive and Interactive Effects of Threats to
Internal Validity
The impact of a threat can be added to

that of another threat or may depend on
the level of another threat (Shadish et al.
2002, 55).
58

03 Statistical &amp; Internal Validity

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Statistical &amp; Internal Validity

Uploaded by

Copyright:

Available Formats

Statistical Conclusion Validity

The validity of inferences about the

Population parameter: A fixed feature of a population (e.g., the

The process of drawing conclusions about a

If we select at random a sample of units from a population of

Example: if we select 3 units at random from a population of 8 units,

For a given population

If we were to draw infinitely many random samples,

Population Sample Sample mean

The population standard deviation is usually

Under the working/null hypothesis

A t-statistic for the sample mean is

If the null hypothesis is

A t-statistic for the sample mean is

If = then, as long as the sample is large

With standard normal variables, the

If we repeatedly drew infinitely many independent

would include the population mean about 95% of

Confidence level: the percentage of all possible independent

A large t-statistic may be due to

A large effect size

Or a small estimated standard error

The probability of getting the observed

Do not reject H0 Reject H0

Correct Decision Type I Error

TRUTH Type II Error Correct Decision

An insufficiently powered experiment may

Low power Larger estimated SE Wider

Factors affecting power

Factors affecting power

Factors affecting power (contd)

Source: Lane 2015

Factors affecting power (contd)

Violations of statistical test assumptions can lead to

Repeated tests for significant relationships, if uncorrected for

Measurement error weakens the relationship between two variables

Structural equation modeling (SEM): family of

Reduced range on a variable usually weakens the

If a treatment that is intended to be implemented in a

Increased variability on the outcome variable within conditions

Some statistics systematically overestimate or

The validity of inferences about

Local molar causal validity (Campbell 1986)

Typically, we infer from an effect to a cause by

Lack of clarity about which variable occurred first may yield

Systematic differences over conditions in respondents

[External] Events occurring concurrently with

Naturally occurring changes over time could be

When units are selected for their extreme scores,

Loss of respondents to treatment or to measurement can produce

Exposure to a test can affect scores on subsequent

The nature of a measure may change over time or conditions

The impact of a threat can be added to

You might also like

03 Statistical & Internal Validity

03 Statistical & Internal Validity