You are on page 1of 4

sample

A subset of the population Confidence Interval (CI)


population The entire group of interest 𝑋 ± (𝑧0 1 )𝜎 𝑛; For 𝑧0 1 , 90% --1.645; 95% -- 1.96; 99% -- 2.58
A concept, which relates to population value, The larger the sample size the shorter the CI.
parameter
𝜇, 𝜎
The numerical representation of the parameter The larger the population standard deviation, the larger the sample
statistic
concept, to sample value, 𝑋, s size you will need to have a CI of length 2*E.
Data that is way out of line compared to the Sample size: if we know 𝜎 and 1 2 width E
outlier
rest of the value * 1
𝑛 = 𝑧0 1 6 for 𝛼= 0.10, 𝑧0 1 = 1.645; 𝛼 = 0.05, 𝑧0 1 = 1.96;
estimator Formula based upon sample data
𝛼 = 0.01, 𝑧0 1 = 2.58
Numerical value obtained by plugging in sample
Sample size: if we know 𝜎 and Type II error probability
estimate data to the estimator. 2009 Q1: It gives a B
@A B C@D
property of a sample. 𝑛 = 𝜎1 for 𝛽 = 0.1, 𝑧G = 1.28; 𝛽 = 0.2, 𝑧G = 0.84
∆B
The same experimental set up gets the same
repeatability
results. 2016 Q1: We can use a z confidence interval to test whether the
Similar setups at different locations get the sample mean is equal to μ0 when the variance σ 2 is known, and we
reproducibility
same experiments results. have 7 observations? Hypothesis tests are about population
robust Widely varied experimental setups get the parameters and not sample statistics. As a scientific tradition, we
reproducibility supporting results. chose what we want t disprove as the null and only if the evidence
We described samples and populations; We want to make inference is overwhelming reject it.
about populations; We draw samples from the population to do so;
Different samples give different results. Type I error (False Reject) occurs when you say that the null
hypothesis is false when in fact it is true. There is no relationship
The population mean is the average of all the outcomes in the between sample size and the probability of a type I error.
population. The population median is the central point. Variability Type II error (False ‘Accept’) occurs when you say that the null
means how widely spread out the population is. ® Variance, s
2
hypothesis is true when in fact it is false. This often occurs because
measures how far the data are from the sample mean. ® square root your study sample size is too small to detect meaningful departures
to standard deviation s. Outliers affect sample means much more from H0. BUT, we never accept the null hypothesis, just say at this CI
than they do sample medians. level, we can’t reject H0.

Interquartile Range (IQR) Statistical power is defined as the probability that you will reject the
defined as the difference null hypothesis when you should reject it. If β is the probability of a
between the 75th and 25th Type II error, power = 1-β. The power depends crucially on the sample
percentiles. Boxplot go out to size. Which means, we can avoid Type II error by manipulate the
furthest point 1.5 IQR from 75th sample size. (We should have a chance of rejecting H0 when n is large
and 25th percentiles (To the last enough) (一类错误是设的,二类错误可避免)
point of the data). Beyond this
boundary are outliers. IQR is a Z test (no in the exam)
very robust measure of Need to know 𝜎 (nearly impossible) 𝑍 =
I()J

variability which can be judged * K

graphically. Reject the null hypothesis if:


(a = 0.01, confidence = 99%) |Z| > 2.58
Qualitative variables (death, forensic match) can be given numbers to (a = 0.05, confidence = 95%) |Z| > 1.96
make them random variables. (a = 0.10, confidence = 90%) |Z| > 1.645

Normal curve/distribution, or the Gaussian distribution or a bell- Student’s t-distribution: In real life, the population standard
shaped curve, relies on 2 parameters, 𝜇, 𝜎. If the q-q plot is nearly deviation 𝜎 is never know. Estimate with s. The CI is longer.
straight, then the Gaussian assumption is reasonable. (Explain this in 𝑋 ± [𝑡0 1 (𝑛 − 1)]𝑠 𝑛, n-1 = degree of freedom (df), Table 2 get t
test if needed.) Pr(X < c) ® 𝑧 =
'()
® Table 1 (z-value) 计算概率
* t test
Not all data are normally distributed, some may skew shaped. The
Z 值大才能拒绝原假设 H0(对比 z 值)
standard data transformations are: square root and logarithm.
p 值小才能拒绝原假设 H0(对比𝜶值,probability of Type I error)

If p < 0.05, this means that you have rejected the null hypothesis with
Sample Standard Error: s.e. = 𝑠
𝑛 a confidence interval of 95% or a Type I error rate of 0.05.
The sample mean is a random variable. If your hypothesis is that μ= 0, then you reject the hypothesis if:
I
𝑡= > 𝑡0 1 (𝑛 − 1) or p-value < a
Central limit theorem (CLT): The bigger the sample the closer to Q K
normally distributed is the sample mean. X has a sampling
distribution that converges with increasing sample sizes to a normal One-population comparison
distribution more slowly for some distributions than others. paired t-test (Matched pair t-test): The only difference between
paired t-test and sign test is that paired t-test requires the sample
A considerable part of basic statistics is to make inferences about the differences follow a normal distribution. If the conditions of the
population mean μ. It is impossible to know the value of μ exactly. paired t-test are met, the power is higher. 单样本 t 检验
Because almost every sample will give you a unique sample mean, In this kind of test, we only have 1 population. If we have 2 variables,
and that sample mean will not equal the population mean. first pair it and compute the difference of the variables and make
inference on the difference. Now we have only 1 variable. (配对相减)
The differences between two variables can have many outliers. Sample Question:
Outliers affect the sample mean and especially the sample standard 2015 Q9: A lab is deciding whether a field gauge for measuring
deviation, making the s.d. larger. Large s.d. mean larger confidence concrete compression is equivalent to the laboratory instrument.
intervals and hence less power. Each instrument measures each of 4 specimens. From the output
Wilcoxon signed rank test (better than sign test): One sample below, using α=.05 are the field and laboratory instruments
(compare to population) or Paired sample data. nonparametric tests equivalent? Explain what is the correct t-test to use and why you
Wilcoxon signed rank test (better than sign test) not so affected by reached your conclusion.
outliers. 单样本或配对有 outlier 不符合正态分布时使用。 The measurements from each instrument are paired. One pair of
measurements was taken for each specimen. Therefor the matched
paired t-test is appropriate. Since the p-value .0018 is less than .05
Two-population comparison, generally, people quote the “Variances we reject the null hypothesis that the two instruments are equivalent.
assumed equal” p-values and CI. If you can reasonably believe that
the population sd’s are nearly equal, it is customary to pick the equal 2015 Q10: Suppose that we wish to construct a confidence 90%
variance assumption and estimate the common standard deviation by: confidence interval for the population mean μ that has a 1⁄2 width of
KU (V QUB ( KB (V QBB V V 1. Assume that measurements are normally distributed with a known
𝑆T = Standard error: 𝑆T + σ = 5. What is the minimum sample size that is needed?
KU CKB (1 KU KB
2 sample t-test (equal variance) pooled t-test 双样本 t 检验 𝜎 1 5 1
𝑛 = 𝑧0 1 = 1.645× ≈ 68
Generally, you should make your sample sizes nearly equal, or at 𝐸 1
least not wildly unequal. v2 Suppose that we wish to test a research hypothesis that μ1=20.
IU (IB Assume that measurements are normally distributed with σ = 5.
𝑡= > 𝑡0 1 (𝑛V + 𝑛1 − 2) or p-value < a
XY
U
C
ZU ZB
U
Determine the sample size needed to obtain a test having a =.05 and
power ³ .9 when the difference to detect is D ³ 5.
1 1
Comparing two population variances using Levene’s test: Levene’s 𝑧0 1+ 𝑧G 1.96 + 1.28
𝑛 = 𝜎1 = 25 ≈ 11
p-value < a means unequal variances. ∆1 25

Welch's Test (Welch’s unequal variances t-test): Like t-test but used 2015 Q11: A potato chip manufacturer wants to compare a new salt
only when the two samples have unequal variances and unequal measuring instrument against the standard device. The manufacturer
sample sizes.两样本正态分布方差不同时选用 intends to compare them using a paired t-test using several batches
of potato chips. The budget only will allow 30 batches with one
Wilcoxon rank sum test (Rank test): Two sample nonparametric test measurement per batch. Does it make sense to do the comparison?
Because sample means and standard deviations are sensitive to Why or why not? Were it to make sense to implement this
outliers, so too are comparisons of populations based on them. Rank comparison how many batches should be used. Assume and assume
tests form a robust alternative, that can be used to check the results that the standard deviation of the difference is 4%, and that they want
of t-statistic inferences. (Q6.18: because of the small sample size and 95% power.
possible lack of normality.) 双 样 本 (If you have data (raw or The experiment only makes sense if the is enough power to detect a
B
transformed) that pass q-q plots tests, then Wilcoxon and t-test meaningful difference. The formula to use is 𝑛 = 𝜎 1
@A B C@D
with
should have much the same p-values.) ∆B
any delta of your choosing. All other parameter values are given in

the question.



v2 -- A hospital wants to test a new cancer treatment against the

established treatment. The hospital intends to compare them using a

pooled t-test as they are reasonably certain the survival time, the

outcome variable, has the same variability under both treatments.

The budget only will allow one hundred consenting cancer patients to

be used. Does it make sense to do the test? Why or why not? Were it

to make sense to implement this study how many patients should be

allocated to each treatment group? Why?

Whether it makes sense to do the test depends upon the desired

power to detect meaningful differences in the population. The

treatments should be allocated as evenly as possible to make the

standard error and confidence interval as small as possible. This type

of allocation of samples example was worked in class.



2016 Q7: For crack cocaine, how many measurements must be done

if the true amount of cocaine in a confiscated package is 270 grams,

and the DEA (drug enforcement agency) wants to be 90% sure that

they do not declare it over the sentencing threshold. Use alpha = .05,

and assume that the measurement standard deviation is 10 grams. (X

= 280 grams cocaine)

Sample size n for a statistical test on µ and s when sampling from a
B
@A B C@D
normal distribution: 𝑛 > 𝜎 1 where 𝜎 = 10, 𝑧0 1 = 1.96,
∆B
𝑧G = 1.28, ∆= 280 − 270 = 10 plug into formula.

2016 Q5: Using the Espy file described in class the ages at execution • Using your own examples explain the difference between
are compared by race. We are interested in testing whether the mean populations and samples.
ages at execution are equal for all races.
a. State the hypotheses being tested. Population: The entire collection of individuals of interest.
H0: μwhite = μblack = μHispanic ; Ha: not H0 In NHANES, the population is all women in the U.S. aged 30-50 (but
b. What is an appropriate test to use from the output below? What is not those on visits, and due to aging the population changes over
the appropriate p-value? What is your conclusion regarding the time) Since there are millions of such women, it is impractical to
hypotheses state in a. figure out the health and nutrition for all of them: it would cost
ANOVA (standard deviations are about equal, or Welsh test as billions of dollars to do so.
variances test differently, or Wilcoxon due to outliers. Any of these
with the correct reason is acceptable. Sample: A subset of the population that is measured in lieu of
c. What tests in the output below control the overall error rate measuring everyone in the population.
(experiment wise error rate?) Since we want the sample to represent the population, the goal is
ANOVA (if we stop there), Welsh test (same) or Wilcoxon because its to make sure we sample a representative subset of the population.
error rate is adjusted. In NHANES, women were sampled at random from the population,
d. From your answer(s) to part c, what method has the most power the randomness meant to ensure that the sample is representative.
and why? Using your choice for the most powerful method, if the
means are not all equal, how are they different? • Why do I say that I never accept a null hypothesis? Use your own
Any of the above answers with the correct reasons are correct. Using examples to make your point.
the Wilcoxon: The mean age of Blacks at execution is different for
whites and Hispanics. Suppose we use a 95% confidence interval, it includes zero: [-3,6].
In this case, it is ‘with 95% confidence, we cannot reject that the
2009 Q2: What of the following is true? population mean is zero.’ By definition, the chance is 95% that the
• The mean is more effected by outliers than the median. true population mean is between -3 and 6: hence, the true
• Outliers make it harder to reject the null hypothesis of equal means. population mean could be 5, and is not necessarily = 0.
• The estimated interquartile range is more effected by outliers than
the estimated standard deviation. (Not true)
HW 1.2: During 2012, Texas had listed on FracFocus, an industry
2009 Q3: What is the effect of decreasing sample size in a test? fracking disclosure site, nearly 6,000 oil and gas wells in which the
It makes it less likely that the null hypothesis will be rejected. fracking methodology was used to extract natural gas. Fontenot et al.
(2013) reports on a study of 100 private water wells in or near the
2009 Q4: What is the typical effect of outliers on p-values? Barnett Shale in Texas. There were 91 private wells located within 5
• It makes them bigger. • It causes increased likelihood of failure to km of an active gas well using fracking, 4 private wells with no gas
reject the null hypothesis. wells located within a 14-km radius, and 5 wells outside of the Barnett
Shale with no gas well located with a 60-km radius. They found that
2009 Q5: A transportation agency wants to monitor speeds on a there were elevated levels of potential contaminants such as arsenic
major highway. The speed measurement standard deviation is 3 MPH, and selenium in the 91 wells closest to natural gas extraction sites
and the agency wants to detect changes from the speed limit (over or compared to the 9 wells that were at least 14 km away from an active
under) within 0.5 MPH with 95% confidence. How many cars does gas well using the fracking technique to extract natural gas. •If the
the transportation agency have to measure? sample measurements are used to make inferences about the
𝜎 1 3 1 population characteristics, why is a measure of reliability of the
𝑛 = 𝑧0 1 = 1.96× ≈ 139 inferences important?
𝐸 .5

2009 Q10 We want to relate the level of contaminants of the 100 points in the
• Explain why it is important to account for bias in confidence sample to the level in the whole suspect area. Thus we need to know
intervals. Use your own examples to make your point. how accurate a portrayal (描绘) of the population is provided by the
100 points in the sample.
In statistics, it has just one meaning that the bias of an estimator is
the difference between its expected value and the quantity being HW 4.5 The gaming commission in its annual examination of the
estimated. casinos in the state reported that all roulette wheels were fair. Explain
the meaning of the term fair with respect to the roulette wheel?
The sample mean X is an unbiased estimator of the population
mean μ. This means that X is accurate, on average; but of course, “Fair” means that each slot on the wheel has an equal chance of
for any particular data set X1, . . ., Xn, the sample mean may be coming up and that the outcome of one spin does not impact the
higher or lower than the true value μ. The purpose of a confidence outcome on another.
interval is to supplement the point estimate X with information
about the uncertainty in this estimate. HW 4.108 Random samples of size 5, 20, and 80 are drawn from a
population with a mean of μ = 100 and a standard deviation of σ = 15.
What we can do is to construct an interval of possible values for the Give the mean of the sampling distribution of y for each of the three
population mean μ. The interval is determined by how much sample sizes. Give the standard deviation of the sampling distribution
“confidence” we want in saying that the population mean μ is in of y for each of the three sample sizes. Based on the results obtained
the interval. in parts (a) and (b), what do you conclude about the accuracy of using
the sample mean y as an estimate of population mean μ?


The mean of the sampling distribution of y is 100 for each of the
sample size of 5, 20, 80. The standard deviation of the sampling
distribution of y is 6.708 for sample size of 5, 3.354 for sample size of
20, and 1.677 for sample size 80. As the sample size increases, the
sampling distribution of y concentrates about the true value of μ. For
n = 5 and 20, the value of y could be a considerable distance from 100.

HW 6.26 Researchers are studying two existing coatings used to
prevent corrosion in pipes that transport natural gas. The study
involves examining sections of pipe that had been in the ground at
least 5 years. The effectiveness of the coating depends on the pH of
the soil, so the researchers recorded the pH of the soil at all 20 sites
at which the pipe was buried prior to measuring the amount of
corrosion on the pipes. The pH readings are given here. Describe how
the researchers could conduct the study to reduce the effect of the
differences in the pH readings on the evaluation of the difference in
the two coatings’ corrosion protection.

The researchers should pair the data based on research site. This
would reduce or eliminate the effect of soil pH on corrosion, which
would otherwise be confounded with the effectiveness of the
coatings.

HW 6.27 Suppose you are a participant in a project to study the
effectiveness of a new treatment for high cholesterol. The new
treatment will be compared to a current treatment by recording the
change in cholesterol readings over a 10-week treatment period. The
effectiveness of the treatment may depend on each participant’s age,
body fat percentage, diet, and general health. The study will involve
at most 30 participants because of cost considerations. How would
you decide which method, paired or independent samples, would be
more efficient in evaluating the change in cholesterol readings?

If there is a large difference in the participants with respect to age,
body fat percentage, diet, and general health and if the pairing results
in a strong positive correlation in the responses from paired
participants, then the paired procedure would be more effective. If
the participants are quite similar in the desired characteristics prior
to the beginning of the study, then the independent samples
procedure would yield a test statistic have twice as many degrees of
freedom as the paired procedure and hence would be more powerful.



Why used paired t-test?

For example, let’s assume that “before” and “after” represent test
scores, and there was an intervention in between them. If the before
and after scores in each row of the example worksheet represent the
same subject, it makes sense to calculate the difference between the
scores in this fashion—the paired t-test is appropriate. However, if
the scores in each row are for different subjects, it doesn’t make
sense to calculate the difference. In this case, you’d need to use
another test, such as the 2-sample t-test.

The paired t-test calculates the difference within each before-and-
after pair of measurements, determines the mean of these changes,
and reports whether this mean of the differences is statistically
significant. A paired t-test can be more powerful than a 2-sample t-
test because the latter includes additional variation occurring from
the independence of the observations.

You might also like