Professional Documents
Culture Documents
Tsagris Michael
BSc in Statistics
Emai: mtsagris@yahoo.gr
Statistics Using Excel Tsagris Michael
2
Statistics Using Excel Tsagris Michael
Contents
1.1 Introduction............................................................................................................4
2.1 Data Analysis..........................................................................................................5
2.2 Descriptive Statistics..............................................................................................6
2.3 Z-test for two samples............................................................................................8
2.4 t-test for two samples assuming unequal variances ............................................9
2.5 t-test for two samples assuming equal variances ..............................................10
2.6 F-test for the equality of variances .....................................................................11
2.7 Paired t-test for two samples...............................................................................12
2.8 Ranks, Percentiles, Sampling, Random Numbers Generation ........................13
2.9 Covariance, Correlation, Linear Regression.....................................................15
2.10 One-way Analysis of Variance..........................................................................18
2.11 Two-way Analysis of Variance with replication .............................................19
2.12 Two-way Analysis of Variance without replication........................................21
3.1 Statistical Functions.............................................................................................23
3.2 Spearman’s (non-parametric) correlation coefficient ......................................27
3.3 Wilcoxon Signed Rank Test for a Median.........................................................28
3.4 Wilcoxon Signed Rank Test with Paired Data ..................................................29
3
Statistics Using Excel Tsagris Michael
1.1 Introduction
One of the reasons for which these notes were written was to help students and
not only to perform some statistical analyses without having to use statistical software
such as Splus, SPSS, and Minitab e.t.c. It is reasonable not to expect that excel offers
much of the options for analyses offered by statistical packages but it is in a good
level nonetheless. The areas covered by these notes are: descriptive statistics, z-test
for two samples, t-test for two samples assuming (un)equal variances, paired t-
test for two samples, F-test for the equality of variances of two samples, ranks
and percentiles, sampling (random and periodic, or systematic), random
numbers generation, Pearson’s correlation coefficient, covariance, linear
regression, one-way ANOA, two-way ANOVA with and without replication and
the moving average. We will also demonstrate the use of non-parametric statistics in
Excel for some of the previously mentioned techniques. Furthermore, informal
comparisons with the results provided by the Excel and the ones provided by SPSS
and some other packages will be carried out to see for any discrepancies between
Excel and SPSS. One thing that is worthy to mention before somebody goes through
these notes is that they do not contain the theory underlying the techniques used.
These notes show how to cope with statistics using Excel.
4
Statistics Using Excel Tsagris Michael
Picture 1
5
Statistics Using Excel Tsagris Michael
Picture 2
Picture 3
6
Statistics Using Excel Tsagris Michael
Picture 4
Column1
Mean 194.0418719
Standard Error 5.221297644
Median 148.5
Mode 97
Standard Deviation 105.2062324
Sample Variance 11068.35133
Kurtosis -0.79094723
Skewness 0.692125308
Range 451
Minimum 4
Maximum 455
Sum 78781
Count 406
Confidence 10.26422853
Level(95.0%)
The results are pretty much the same as should be. There are only some really
slight differences with regard to the rounding in the results of SPSS but of not
importance. The sample variances differ slightly but it is really not a problem. SPSS
calculates a 95% confidence interval for the true mean whereas Excel provides only
the quantity used to calculate the 95% confidence interval. The construction of this
interval is really straightforward. Subtract this quantity from the mean to get the lower
limit and add it to the mean to get the upper limit of the 95% confidence interval.
7
Statistics Using Excel Tsagris Michael
Picture 5
8
Statistics Using Excel Tsagris Michael
Table 2: Z-test
Picture 6
9
Statistics Using Excel Tsagris Michael
Picture 7
The results are the same with the ones provided by SPSS. What is worthy to
mention and to pay attention is that the degrees of freedom (df) for this case are equal
to 178, whereas in the previous case were equal to 96. Also the t-statistics is slightly
different. The reason it that different kind of formulae are used in both cases.
10
Statistics Using Excel Tsagris Michael
Picture 8
11
Statistics Using Excel Tsagris Michael
Picture 9
12
Statistics Using Excel Tsagris Michael
Picture 8
13
Statistics Using Excel Tsagris Michael
Picture 9
If you are interested in a random sample from a know distribution then the
random numbers generation is the option you want to use. Unfortunately not many
distributions are offered. The window of this option is at picture 10. In the number of
variables you can select how many samples you want to be drawn from the specific
distribution. The white box below is used to define the sample size. The distributions
offered are Uniform, Normal, Bernoulli, Binomial, and Poisson. Two more options
are also allowed. Different distributions require different parameters to be defined.
The random seed is an option used to give the sampling algorithm a starting value but
can be left blank as well.
Picture 10
14
Statistics Using Excel Tsagris Michael
Picture 11
Column Column
1 2
Column 1.113367
1
Column 0.531949 7.972812
2
Table 7: Covariance
The above table is called the variance-covariance table since it produces both
of these measures. The first cell (1.113367) refers to the variance of the first column
and the last cell refers to the variance of the second column. The remaining cell
(0.531949) refers to the covariance of the two columns. The blank cell is white due to
the fact that the value is the covariance (the elements of the diagonal are the variances
and the others refer to the covariance). The window of the linear regression option is
presented at picture 12. (Different normal data used in the regression analysis). We
fill the white boxes with the columns that represent Y and X values. The X values can
contain more than one column (i.e. variable). We select the confidence interval
option. We also select the Line Fit Plots and Normal Probability Plots. Then by
pressing OK, the result appears in table 8.
15
Statistics Using Excel Tsagris Michael
Picture 12
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.875372
R Square 0.766276
Adjusted R 0.76328
Square
Standard Error 23.06123
Observations 80
ANOVA
df SS MS F Significance
F
Regression 1 136001 136001 255.7274 2.46E-26
Residual 78 41481.97 531.8202
Total 79 177483
Coefficients Standard t Stat P-value Lower 95% Upper 95%
Error
Intercept -10.6715 8.963642 -1.19053 0.237449 -28.5167 7.173767
X Variable 1 0.043651 0.00273 15.99148 2.46E-26 0.038217 0.049085
16
Statistics Using Excel Tsagris Michael
250
200
150 Y
Y
100 Predicted Y
50
0
0 2000 4000 6000
X Variable 1
250
200
150
Y
100
50
0
0 20 40 60 80 100 120
Sample Percentile
The first figure is a scatter plot of the data, the X values versus the Y values
and the predicted Y values. The linear relation between the two variables is obvious
through the graph. Do not forget that the correlation coefficient exhibited a high
value. The Normal Probability Plot is used to check the normality of the residuals
graphically. Should the residuals follow the normal distribution, then the graph should
be a straight line. /unfortunately many times the eye is not the best judge of things.
The Kolmogorov Smirnov test conducted in SPSS provided evidence to support the
17
Statistics Using Excel Tsagris Michael
normality hypothesis of the residuals. Excel produced also the residuals and the
predicted values in the same sheet. We shall construct a scatter plot of these two
values, in order to check (graphically) the assumption of homoscedasticity (i.e.
constant variance through the residuals). If the assumption of heteroscedasticity of the
residuals holds true, then we should see all the values within a bandwidth. We see that
almost all values fall within 40 and -40, except for two values that are over 70 and
100. These values are the so called outliers. We can assume that the residuals exhibit
constant variance. If we are not certain as for the validity of the assumption we can
transform the Y values using a log transformation and run the regression using the
transformed Y values.
120
100
80
60
Residuals
40
Series1
20
0
0 50 100 150 200 250
-20
-40
-60
Predicted Values
18
Statistics Using Excel Tsagris Michael
Picture 13
The results generated by SPSS are very close with the results shown above.
There is some difference in the sums of squares, but rather of small importance. The
mean square values (MS) are very close with one another. Yet, by no means can we
assume that the above results hold true since Excel does not offer options for
assumptions checking.
19
Statistics Using Excel Tsagris Michael
other words the first combination the two factors are the cells from B2 to B26. This
means that each combination of factors has 24 measurements.
Picture 14
Picture 15
We filled the two blank white boxes with the input range and Rows per
sample. The alpha is at its usual value, equal to 0.05. By pressing OK the results are
presented overleaf. The results generated by SPSS are the same. At the bottom of the
table 10 there are three p-values; two p-values for the two factors and one p-value for
the interaction. The row factor is denoted as sample in Excel.
20
Statistics Using Excel Tsagris Michael
21
Statistics Using Excel Tsagris Michael
Picture 16
ANOVA
Source of SS df MS F P-value F crit
Variation
Rows 136.2222 2 68.11111 0.160534 0.856915 6.944272
Columns 94504.22 2 47252.11 111.3707 0.000311 6.944272
Error 1697.111 4 424.2778
Total 96337.56 8
22
Statistics Using Excel Tsagris Michael
• AVEDEV calculates the average of the absolute deviations of the data from
their mean.
• CHITEST calculates the result of the test for independence: the value from the
chi-squared distribution for the statistics and the appropriate degrees of
freedom.
• DEVSQ calculates the sum of squares of deviations of data points from their
sample mean. The derivation of standard deviation is very straightforward,
simply dividing by the sample size or by the sample size decreased by one to
get the unbiased estimator of the true standard deviation.
23
Statistics Using Excel Tsagris Michael
• FREQUENCY calculates how often values occur within a range of values and
then returns a vertical array of numbers having one or more elements than
Bins_array.
• FTEST returns the result of the one-tailed test that the variances of two data
sets are not significantly different.
• INTERCEPT calculates the point at which a line will intersect the y-axis.
• LINEST generates a line that best fits a data set by generating a two
dimensional array of values to describe the line.
• LOGEST generates a curve that best fits a data set by generating a two
dimensional array of values to describe the curve.
24
Statistics Using Excel Tsagris Michael
• MAX returns the largest value in a data set (ignore logical values and text).
• MAXA returns the largest value in a set of data (does not ignore logical values
and text).
• MIN returns the largest value in a data set (ignore logical values and text).
• MINA returns the largest value in a data set (does not ignore logical values
and text).
• PEARSON returns a value that reflects the strength of the linear relationship
between two data sets.
25
Statistics Using Excel Tsagris Michael
• PROB calculates the probability that values in a range are between two limits
or equal to a lower limit.
• RANK calculates the rank of a number in a list of numbers: its size relative to
other values in the list.
• RSQ calculates the square of the Pearson correlation coefficient (also met as
coefficient of determination in the case of linear regression).
• STDEVA estimates the standard deviation of a data set (which can include
text and true/false values) based on a sample of the data.
• STDEVPA calculates the standard deviation of a data set (which can include
text and true/false values).
• STEYX returns the predicted standard error for the y value for each x value in
regression.
26
Statistics Using Excel Tsagris Michael
• VARA estimates the variance of a data set (which can include text and true/
false values) based on a sample of the data.
• VARPA calculates the variance of a data population, which can include text
and true/false values.
27
Statistics Using Excel Tsagris Michael
Column 1 contains the values, Rank contains the ranks of the values, Percent
contains the cumulative percentage of the values (the size of the values relative to the
others) and the first column (Points) indicates the row of each value. In the above
table, Excel has sorted the values according to their ranks. The first column indicates
the exact position of the values. We have to sort the data with respect to this first
column, so that the format will be as in the first place. We will repeat these actions for
the second set of data and then calculate the correlation coefficient of the ranks of the
values. Attention is to be paid at the sequence of the actions described. The ranks of
the values must be calculated separately for each data set and the sorting need to be
done before calculating the correlation coefficient. The results for the data used in this
example calculated the Spearman’s correlation coefficient to be equal to 0.020483
whereas the correlation calculated using SPSS is equal to 0.009. The reason for this
difference in the two correlations is that SPSS has a way of dealing the values that
have the same rank. It assigns to all values the average of the ranks. That is, if three
values are equal (so their ranks are the same), SPSS assigns to each of these three
values the average of their ranks (Excel does not do this action).
1. Step 1: Subtract all the values from the given median (i.e. 320-Xi, i=1,2, …, n,
where n=sample size).
2. Step 2: In a new column calculate the absolute values of these subtractions.
3. Step 3: Calculate the ranks of the absolute values.
4. Step 4: Using the logical function If decide assign 1 if the differences in the
second column are positive and -1 if they are negative.
5. Step 5: Multiply the 4th and the 5th columns to get the ranks with a sign
(plus/minus).
6. Step 6: Define a last column to be the squared ranks
Table 13 summarizes all of the above. All of the tedious work is complete.
Now the rest is mere details. In cases when there are values with the same ranks (i.e.
ties) we use this formula for the test:
We calculate the sum of the 6th column and of the square root of the sum of
th
the 7 column. Finally, we divide the sum by the square of the second sum to get the
test statistic. In this example, the sum of squares is equal to 289, the sum of squared
ranks is equal to 117363 and its square root is equal to 342.5828. The test statistics is
289 divided by 342.45828, which is equal to 0.8436. SPSS provides a little different
28
Statistics Using Excel Tsagris Michael
test statistics due to the different handling of the tied ranks and the use of different test
statistic. There is also another way to calculate a test statistics and that is by taking the
sum of the positive ranks. Both Minitab and SPSS calculate another type of test
statistic, which is based on either the positive or the negative ranks. What is worthy to
mention is that the second formula is better used in the case when there are no tied
ranks. Irrespectively of the test statistics used the result will be the same as for the
rejection of the null hypothesis. Using the second formula the result is 1401, whereas
Minitab provides a result of 1231.5. As for the result of the test (reject the null
hypothesis or not) one must look at the tables for the 1 sample Wilcoxon signed rank
test. The fact that Excel does not offer options for calculating the probabilities used in
the non-parametric tests in conjunction with the tedious work, makes it less popular
for use.
29
Statistics Using Excel Tsagris Michael
Table 14: Procedure of the Wilcoxon Signed Rank Test with Paired Data
30