You are on page 1of 13

FECO Note 1 - Review of Statistics

Xuan Chinh Mai

January 6, 2018

1
Contents

1 Review of Probability 3

1.1 General Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Computing Probabilities Involving Normal Random Variables . . . . . . . . 3

2 Random Sampling and i.i.d. Random Variables 4

3 Estimation 4

3.1 Estimators and Their Properties . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Estimation of the Population Mean . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 Estimation of the Population Variance . . . . . . . . . . . . . . . . . . . . . 5

4 The Sampling Distribution of the Sample Average 6

4.1 The Mean and Variance of the Sample Mean . . . . . . . . . . . . . . . . . . 6

4.2 Large-Sample Approximations to Sampling Distributions . . . . . . . . . . . 7

4.3 The Standard Error of Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Hypothesis Tests concerning the Population Mean 8

5.1 Null and Alternate Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.2 The p-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.3 The t-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.4 Confidence Intervals for the Population Mean . . . . . . . . . . . . . . . . . 11

5.5 The Process of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . 11

2
1 Review of Probability

1.1 General Formulas


Pl
1. Marginal Probability Distribution: P (Y = y) = i=1 P (X = xi , Y = y)

2. Conditional Distribution: P (Y = y|X = x) = P (X=x,Y =y)


P (X=x)

3. Conditional Expectation: E(Y |X = x) = ki=1 yi P (Y = yi |X = x)


P

4. The law of iterated Expectations: E(Y ) = li=1 E(Y |X = xi )P (X = xi ) = E[E(Y |X)]


P

5. Conditional Variance: V ar(Y |X = x) = ki=1 [yi − E(Y |X = x)]2 P (Y = yi |X = x)


P

6. Independence: P (Y = y|X = x) = P (Y = y)
7. Covariance and correlation:
k X
X l
Cov(X, Y ) = σXY = E[(X−µX )(Y −µY )] = (xj −µX )(yi −µY )P (X = xj , Y = yi )
i=1 j=1

Cov(X, Y ) σXY
Corr(X, Y ) = p =
V ar(X)V ar(Y ) σX σY
8. The Mean and Variance of Sums of Random Variables:
• E(a + bX + cY ) = a + bµX + cµY
• V ar(a + bY ) = b2 σY2
• V ar(aX + bY ) = a2 σX 2
+ 2abσXY + b2 σY2
• E(Y 2 ) = σY2 + µ2Y
• Cov(a + bX + cV, Y ) = bσXY + cσV Y
• E(XY ) = σXY + µX µY
p
2 2
• |Corr(X, Y )| ≤ 1 ∧ |σXY | ≤ σX σY

1.2 Computing Probabilities Involving Normal Random Variables

Suppose Y is normally distributed with mean µ and variance σ 2 ; in other words, Y ∼


N (µ, σ 2 ). Then Y is standardized by subtracting its mean and dividing by its standard
deviation, that is, by computing Z = Y σ−µ . Let c1 and c2 denote two numbers with c1 < c2
and let d1 = c1σ−µ and d2 = c2σ−µ . Then
P (Y ≤ c2 ) = P (Z ≤ d2 ) = Φ(d2 ),
P (Y ≥ c1 ) = P (Z ≥ d1 ) = 1 − Φ(d1 ),
P (c1 ≤ Y ≤ c2 ) = P (d1 ≤ Z ≤ d2 ) = Φ(d2 ) − Φ(d1 )
Where Φ is the normal cumulative distribution function of N (0, 1).

3
2 Random Sampling and i.i.d. Random Variables

In a simple random sample, n objects are drawn at random from a population and each
object is equally likely to be drawn. The value of the random variable Y for the ith randomly
drawn object is denoted Yi .

Because each object is equally likely to be drawn and the distribution of Yi is the same
for all i, the random variables Y1 , ..., Yn are independently and identically distributed
(i.i.d.); that is, the distribution of Yi is the same for all i = 1, ..., n and Y1 is distributed
independently of Y2 , ..., Yn and so forth.

3 Estimation

Estimation entails computing a “best guess” numerical value for an unknown characteristic
of a population distribution, such as its mean, from a sample of data.

3.1 Estimators and Their Properties

An estimator is a function of a sample of data to be drawn randomly from a population.


An estimate is the numerical value of the estimator when it is actually computed using
data from a specific sample. An estimator is a random variable because of randomness in
selecting the sample, while an estimate is a nonrandom number.

In general, it is preferable that an estimator gets as close as possible to the unknown true
value, at least in some average sense; in other words, we would like the sampling distribution
of an estimator to be as tightly centered on the unknown value as possible. This observation
leads to three specific desirable characteristics of an estimator: unbiasedness, consistency,
and efficiency.

Let µ̂Y denote some estimator of µY , then the desirable characteristics of a good estimator
are:

Unbiasedness: The bias of µ̂Y is E(µ̂Y ) − µ̂Y , where E(µ̂Y ) is the mean of the sampling
distribution of µ̂Y . The estimator µ̂Y is unbiased if its bias is zero, or E(µ̂Y ) = µY ;
otherwise, µ̂Y is biased. This means that, on average, the estimator will give the right
answer. Hence, µ̂Y is an unbiased estimator of µY if E(µ̂Y ) = µY .

Consistency: Another desirable property of an estimator µY is that, when the sample size is
large, the uncertainty about the value µY arising from random variations in the sample
is very small. Stated more precisely, a desirable property of µ̂Y is that the probability
that it is within a small interval of the true value µY approaches 1 as the sample size

4
increases, that is, µ̂Y is consistent for µY . Hence, µ̂Y is an consistent estimator of µY
p
if µ̂Y →
− µY .
Efficiency: If µ̂Y has a smaller variance than that of any other estimator µ̃Y , it is said to
be more efficient. This means that µ̂Y uses the information in the data more efficiently
than does µ̃Y . Hence, µ̂Y is said to be more efficient than µ̃Y if V ar(µ̂Y ) < V ar(µ̃Y ).

3.2 Estimation of the Population Mean

A natural way to estimate the mean value of Y (that is, µY ) in a population is to compute
the sample average Y from a sample of n independently and identically distributed (i.i.d.)
observations, Y1 , ..., Yn . This section discusses estimation of µY and the properties of Y as
an estimator of µY .

The sample average or sample mean, Y , of the n observations Y1 , ..., Yn is:


n
1 1X
Y = (Y1 + Y2 + ... + Yn ) = Yi
n n i=1

The sampling distribution of Y is examined in the next section. The analysis shows that Y
is an unbiased and consistent estimator of µY . Moreover, Y is the most efficient estimator
of all unbiased estimators that are weighted average of Y1 , ..., Yn . Said differently, Y is the
Best Linear Unbiased Estimator (BLUE); that is, it is the most efficient (best) estimator
among all estimators that are unbiased and are linear functions of Y1 , ..., Yn .

It has to be acknowledged that the assumption of random sampling is important because


nonrandom sampling can result in Y being biased due to the fact that the assumption
Y1 , ..., Y2 are i.i.d. draws wouldn’t hold in such a case. Therefore, it is important to design
sample selection schemes in a way that minimizes bias.

3.3 Estimation of the Population Variance

The sample variance s2Y is an estimator of the population variance σY2 , the sample standard
deviation sY is an estimator of the population standard deviation σY .

The sample variance, s2Y , is


n
1 X
s2Y = (Yi − Y )2
n − 1 i=1

Dividing by n − 1 in the equation above instead of n is called a degrees of freedom correction:


Estimating the mean uses up some of the information - that is, uses up 1 “degree of freedom”
- in the data, so that only n − 1 degrees of freedom remain.

5
The sample standard deviation, sY , is the square root of the sample variance
v
u n
u 1 X
sY = t (Yi − Y )2
n − 1 i=1

The sample variance is an unbiased and consistent estimator of the population variance
p
s2Y →
− σY2 and E(s2Y ) = σY2 . In other words, the sample variance is close to the population
variance with high probability when n is large. A similar conclusion also holds for the sample
standard deviation.

4 The Sampling Distribution of the Sample Average

4.1 The Mean and Variance of the Sample Mean

An essential concept is that the act of drawing a random sample has the effect of making
the sample average Y a random variable. Because the sample was drawn at random, the
value of each Yi is random. Because Y1 , ..., Yn are random, their average is random. Had
a different sample been drawn, then the observations and their sample average would have
been different: The value of Y differs from one randomly drawn sample to the next.

Because Y is random, it has a probability distribution. The distribution of Y is called


the sampling distribution of Y because it is the probability distribution associated with
possible values of Y that could be computed for different possible samples Y1 , ..., Yn . The
sampling distribution of averages and weighted averages plays a central role in statistics and
econometrics. The sampling distribution has following characteristics:

Mean and variance of Y : n


1X
E(Y ) = E(Yi ) = µY
n n=1
and
n
! n n n
1X 1 X 1 X X σ2
V ar(Y ) = V ar Yi = 2 V ar(Yi ) + 2 Cov(Xi , Yj ) = Y
n i=1 n i=1 n i=1 j=1,j6=i n

2
σY
If Y1 , ..., Yn are i.i.d. draws from the N (µY , σY2 ) distribution, Y is distributed N (µY , n
).

Sampling distributions are important in statistics because they provide a major simplification
on the route to statistical interference. More specifically, they allow analytical considerations
to be based on the sampling distribution of a statistic, rather than on the joint probability
distribution of all the individual sample values.

6
4.2 Large-Sample Approximations to Sampling Distributions

Sampling distributions play a central role in the development of statistical and econometric
procedures, so it is important to know, in a mathematical sense, what the sampling distri-
bution of Y is. There are two approaches to characterizing sampling distributions: an exact
approach and an approximate approach.

The “exact” approach entails deriving a formula for the sampling distribution that holds
exactly for any value of n. The sampling distribution that exactly describes the distribution
of Y for any n is called the exact distribution or finite-sample distribution of Y . If
the distribution of Y is not normal, then in general the exact sampling distribution of Y is
very complicated and depends on the distribution of Y .

The “approximate” approach uses approximations to the sampling distribution that rely on
the sample size being large. The large-sample approximation to the sampling distribution
is often called the asymptotic distribution - “asymptotic” because the approximations
become exact in the limit that n → ∞. These approximations can be very accurate even
if the sample size is only n = 30 observations. Because sample sizes used in practice in
econometrics typically number in the hundreds or thousands, these asymptotic distributions
can be counted on to provide very good approximations to the exact sampling distribution.

This section presents the two key tools used to approximate sampling distributions when
the sample size is large: the law of large numbers and the central limit theorem. They are
built based on the concept of Convergence in Probability. The sample average Y converges
in probability to µY (or, equivalently, Y is consistent for µY ) if the probability that Y is in
the range (µY − c) to (µY + c) becomes arbitrarily close to 1 as n increases for any constant
p
c > 0. The convergence of Y to µY in probability is written Y → − µY . With this background,
we have:

The Law of Large Numbers: If Yi , i = 1, ..., n are independently and identically dis-
tributed with E(Yi ) = µY and if large outliers are unlikely (technically if V ar(Yi ) =
p
µ2y < ∞), then Y →
− µY .

The Law of Large Numbers states that, under general conditions, Y will near µY with very
high probability when n is large. When a large number of random variables with the same
mean are averaged together, the large values balance the small values and their sample
average is close to their common mean. Moreover, under certain conditions, Y converges in
probability to µY or, equivalently, that Y is consistent for µY .

The Central Limit Theorem: Suppose that Y1 , ..., Yn are i.i.d. with E(Yi ) = µY and
V ar(Yi ) = σY2 , where V ar(Yi ) = µ2y < ∞. As n → ∞, the distribution of Y −µ
σ
Y
(where
Y
2
σY
σY2 = n
) becomes arbitrarily well approximated by the standard normal distribution.

7
The Central Limit Theorem says that, under general conditions, the distribution of Y is
well approximated by a normal distribution when n is large. Moreover, the distribution
of Y is exactly N (µY , σY2 ) when the sample is drawn from a population with the normal
distribution N (µY , σY2 ). This result is approximately true when n is large even if Y1 , ..., Yn
are not themselves normally distributed.

The quality of the normal approximation depends on the distribution of the underlying Yi
that make up the average. At one extreme, if the Yi are themselves normally distributed, then
Y is exactly normally distributed for all n. In contrast, when the underlying Yi themselves
have a distribution that is far from normal then this approximation can require n = 30 or
even more.

4.3 The Standard Error of Y

The standard error of Y is an estimator of the standard deviation of Y . The standard error
of Y is denoted SE(Y ) or σ̂Y . When Y1 , ..., Yn are i.i.d.,
sY
SE(Y ) = σ̂Y = √
n

5 Hypothesis Tests concerning the Population Mean

Many hypotheses about the world around us can be phrased as yes/no questions. The
statistical challenge is to answer these questions based on a sample of evidence.

5.1 Null and Alternate Hypotheses

The starting point of statistical hypotheses testing is specifying the hypothesis to be tested,
called the null hypothesis. Hypothesis testing entails using data to compare the null
hypothesis to an alternative hypothesis that holds if the null does not. The rule of thumb
for specifying hypotheses is that the alternative hypothesis if true should be supported
strongly by the sample, which also means that the null should have the opposite direction
to what needs to be proven.

For the hypothesis tests concerning the population mean, the null hypothesis is that the
population mean, E(Y ), takes on a specific value, denoted µY,0 . The null hypothesis is
denoted H0 and thus is
H0 : E(Y ) = µY,0

The alternative hypothesis specifies what is true if the null hypothesis is not. The most
general alternative hypothesis is that E(Y ) 6= µY,0 , which is called a two-sided alternative

8
hypothesis as it allows E(Y ) to be either less than or greater than µY,0 . The two-sided
alternative is written as
H1 : E(Y ) 6= µY,0

In some circumstances, the null hypothesis might not be that the mean equals to a cer-
tain value µY,0 , but greater or smaller this value instead. This is called a one-sided null
hypothesis and can be written
H0 : E(Y ) > µY,0
or
H0 : E(Y ) < µY,0

Accordingly, the one-sided alternative hypothesis has to be constructed in an opposite


direction from that of the null
H0 : E(Y ) ≤ µY,0
or
H0 : E(Y ) ≥ µY,0

The problem facing the statistician is to use the evidence in a randomly selected sample
of data to decide whether to accept the null hypothesis H0 or to reject it in favor of the
alternative hypothesis H1 . If the null hypothesis is “accepted”, this doesn’t necessarily
mean that the statistician declares it to be true; rather, it is accepted tentatively with the
recognition that it might be rejected later based on additional evidence. For this reason,
statistical hypothesis testing can be posed as either rejecting the null hypothesis or failing
to do so.

5.2 The p-Value

In any given sample, the sample average Y will rarely be exactly equal to the hypothesized
value µY,0 . Differences between Y and µY,0 can arise due to the following reasons:

1. the true mean in fact does not equal µY,0 (the null hypothesis is false)

2. the true mean equals µY,0 (the null hypothesis is true) but Y differs from µY,0 because
of random sampling

It is impossible to distinguish between these two possibilities with certainty. Although a


sample of data cannot provide conclusive evidence about the null hypothesis, it is possible
to do a probabilistic calculation that permits testing the null hypothesis in a way that
accounts for sampling uncertainty. This calculation involves using the data to compute the
p-value of the null hypothesis.

9
p-value: The p-value, also called the significant probability, is the probability of draw-
ing a statistic at least as adverse to the null hypothesis as the one that is actually
computed in the sample, assuming the null hypothesis is correct.

In the case at hand, the p-value is the probability of drawing Y at least as far in the tails of its
distribution under the null hypothesis as the sample average you actually compute. To state
act
the definition of the p-value mathematically, let Y denote the value of the sample average
actually computed in the data set at hand and let PH0 denote the probability computed
under the null hypothesis (that is, computed assuming that E(Yi ) = µY,0 . The p-value is
act
p − value = PH0 [|Y − µY,0 | > |Y − µY,0 |]

When σY is known, the formula of p-value is


! act !
Y − µY,0 Y act − µY,0
Y − µ
Y,0
p − value = PH0 > = 2Φ −

σY σY σY

This formula for the p-value depends on the variance of the population distribution, µ2Y . In
practice, this variance is typically unknown. However, because s2Y is a consistent estimator
of µ2Y , the p-value can be computed by replacing σY in the equation above by the standard
error, SE(Y ) = σ̂Y . that is, when σY is unknown and Y1 , ..., Yn are i.i.d., the p-value is
calculated using the formula
act !
Y − µ
Y,0
p − value = 2Φ −


SE(Y )

5.3 The t-Statistic


Y −µ
The standardized sample average SE(YY,0) plays a central role in testing statistical hypotheses
and has a special name, the t-statistic or t-ratio

Y − µY,0
t=
SE(Y )

In general, a test statistic is a statistic used to perform a hypothesis test. The t-statistic
is an important example of a test statistic.

When n is large, s2Y is close to σY2 with high probability. Thus the distribution of the
Y −µ
t-statistic is approximately the same as the distribution of σ Y,0 which in turn is well
Y
approximated by the standard normal distribution when n is large because of the central
limit theorem. Hence, under the null hypothesis, t is approximately distributed N (0, 1) for

10
large n. Accordingly, when n is large, the formula for the p-value can be rewritten in terms
of the t-statistic act
Y − µY,0
p − value = 2Φ(−|tact |) for tact =
SE(Y )

For the one-sided tests, the general approach to computing p-values is similar to how it is for
the two-sided alternative with a modification that only large positive values of the t-statistic
reject the null hypothesis rather than values that are large in absolute term. Specifically,
the p-values based on the N (0, 1) approximation to the distribution of t-statistic is
p − value = PH0 (t > tact ) = 1 − Φ(tact )
and
p − value = PH0 (t < tact ) = Φ(tact )

5.4 Confidence Intervals for the Population Mean

The set of values that contains the true population mean µY with a certain prespecified prob-
ability is called the confidence set, where the prespecified probability that µY is contained
in this set is called the confidence level. The confidence set for µY turns out to be all the
possible values of the mean between a lower and an upper limit, so that the confidence set
is an interval, called a confidence interval.

An a% two-sided confidence interval for µY is an interval constructed so that it contains the


true value of µY in a% of all possible random samples. When the sample size n is large,
95%, 90%, and 99% confidence intervals for µY are
90% confidence interval for µY = {Y ± 1.64SE(Y )}
95% confidence interval for µY = {Y ± 1.96SE(Y )}
99% confidence interval for µY = {Y ± 2.58SE(Y )}
The coverage probability of a confidence interval for the population mean is the prob-
ability, computed over all possible random samples, that it contains the true population
mean.

This section so far has focused on two-sided confidence intervals. One could instead construct
a one-sided confidence interval as the set of values of µY that cannot be rejected by a one-
sided hypothesis test. Although one-sided confidence intervals have applications in some
branches of statistics, they are uncommon in applied in econometric analysis.

5.5 The Process of Hypothesis Testing

The process of testing the two-sided null hypothesis E(Y ) = µY,0 against the two-sided
alternative hypothesis E(Y ) 6= µY,0 includes the following steps:

11
1. Compute the standard error of Y , SE(Y ) using the equation
sY
SE(Y ) = σ̂Y = √
n

2. compute the t-statistic using the equation


act
act Y − µY,0
t =
SE(Y )

3. Compute the p-value using the equation

p − value = 2Φ(−|tact |)

Reject the hypothesis at the 5% significance level if the p-value is less than 0.05 (equiv-
alently, if |tact | > 1.96).

Similarly, the one-sided tests also follow these steps but with different formulas instead.
The formula for calculating p-value in step 3 is replaced by p − value = 1 − Φ(tact ) or
p − value = Φ(tact ) depending on which side the test is. Also, the critical values for the t-
statistic are also different for the one-sided test. For the 95% significance level, these values
for the one-sided greater and smaller tests are respectively 1.64 and −1.64.

A statistical hypothesis test can make two types of mistakes:

Type I error (false positive): the type of mistake where the null hypothesis is rejected
when in fact it is true

Type II error (false negative): the type of mistake where the null hypothesis is not re-
jected when in fact it is false.

The prespecified rejection probability of a statistical hypothesis test when the null hypothesis
is true - that is, the prespecified probability of a type I error - is the significance level of
the test.

The critical value of the test statistic is the value of the statistic for which the test just
rejects the null hypothesis at the given significance level.

The set of values of the test statistic for which the test rejects the null hypothesis is the
rejection region, and the values of the test statistic for which it does not reject the null
hypothesis is the acceptance region.

The probability that the test actually incorrectly rejects the null hypothesis when it is true
is the size of the test, and the probability that the test correctly rejects the null hypothesis
when the alternative is true is the power of the test.

12
The p-value is the smallest significance level at which the null hypothesis could be rejected.
Being conservative, in the sense that using a very low significance level, has a cost: The
smaller the significance level, the larger the critical value and the more difficult it becomes
to reject the null when the null is false. The lower the significance level, the lower the power
of the test. Many economic and policy applications can call for less conservatism than legal
case, so a 5% significance level is often considered to be a reasonable compromise.

13

You might also like