You are on page 1of 145

Statistics

Centennial College SETAS Custom Text

Written by Mandy Lam Statistics Page |0


Math210
Table of Contents

Unit Topic Page


Number
Unit 1: Graphical Representation of Data
1.1 – Introduction of Data 3
1.2 - Frequency Polygon 9
1.3 - Histogram 14
1.4 - Stem-and-Leaf Plot 20
Unit 2: Descriptive Statistics
2.1 - Measures of Central Tendency 26
2.2 - Measures of Variation 31
2.3 - Measures of Position 37
Unit 3: Descriptive Methods for Bivariate Data
3.1 - The Least Squares Regression Line 42
3.2 - The Linear Correlation Coefficient 52
Unit 4: Discrete Probability Distributions
4.1 - Introduction to Probability and Random Variables 58
4.2 - Introduction to Discrete Probability Distributions 63
4.3 – The Mean and Variance of a Probability Distribution 68
4.4 - Binomial Probability Distribution 74
4.5 – The Mean and Variance of a Binomial Probability Distribution 82
Unit 5: Continuous Probability Distributions
5.1 – Introduction to Continuous Probability Distributions 85
5.2 - Normal and The Standard Normal Distribution 90
5.3 - Central Limit Theorem 102
Unit 6: Statistical Inference
6.1 - Estimating Population Mean for Large Samples 108
6.2 – Estimating Population Mean for Small Samples 112

Page | 1
6.3 - Estimating Population Proportion for Large and Small Samples 119
6.4 - Hypothesis Testing of Population Mean of a Large Sample 123
6.5 - Hypothesis Testing of Population Mean of a Small Sample 131
6.6 - Hypothesis Testing of Population Proportion of a Large Sample 136
Lab Activities 139
Answers 144

Page | 2
Unit 1: Graphical Representation of Data

1.1 – Introduction of Data

Statistics is a collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting and drawing conclusions based on
data. It allow us to draw inferences about populations by analyzing samples of those
populations.

- organize
- summarize
- present
Plan experiment Collect data - analyze
- interpret
- draw
conclusions

There are different types of data. First, we can distinguish between Quantitative data and
Qualitative data.
Quantitative data is numerical (represents counts or measurements).
Qualitative data is categorical.
For example, ages of election candidates would be quantitative data. The political party
affiliations of the election candidates would be qualitative data. They could be affiliated with
the Liberal Party, Conservative Party, New Democratic Party, or the Green Party. These are
considered to be qualitative data because they are not numerical or quantifiable. Another
example would be collecting data using a survey to determine the number of people supporting
the Liberal Party.
Quantitative data can be further distinguished as either discrete or continuous data.
Discrete data results from either a finite number of possible values or a countable
number of values.
Continuous data results from infinitely many possible values that can be associated with
points on a continuous scale, in such a way that there are no gaps or interruptions.

Page | 3
For example, the number of students studying statistics is discrete data. The number of
students is counted and will always be countable number values as they are whole numbers.
There will never be values such as 2.5 or 3.678 when we discussed number of students in a
class.
Continuous data are usually values that are measured. These include temperature, volume,
time, weights, and heights etc. These data values can be associated with points on a continuous
scale.

Quantitative data that are collected can be also either univariate data or bivariate data.
Univariate data are data that have one variable
Bivariate data are data that have two variables.
This means that if the height of all the students in the class was collected, the data set would be
univariate since height is the only variable. If the height and weight of all the students in the
class are collected, the data set would be bivariate since there are two variables, height and
weight. Collecting bivariate data usually serves the purpose of determining the relationship
between the two variables. For example, collecting data on number of cigarettes smoked per
day and lifespan of the smoker would allow us to determine the relationship between the
amount of smoking and how long a smoker would live.
It is important to understand types of data because appropriate graphical representations of
data would depend on the type of data.

Types of Graphical Representation

Bar Graph
Bar graphs are appropriate for displaying
frequency in each category.
This bar graph shows the number of students
who had chosen each of the sports as their
favorite sport.
Soccer, softball, basketball, and other sport are
qualitative groups.
The number of students is the frequency in
each qualitative group.
Bar graphs allow a visual comparison among
the different groups.

Page | 4
Pictograms
Pictograms are similar to bar graphs. They
show qualitative categories and the frequency
of each category. Rather than using bars to
show frequency, a picture is usually used to
represent the frequency.
In this example, the pictogram shows that
there are 2000 visitors in January and March
has the highest amount of visitors.

Circle Graph
Circle graphs, also known as Pie Charts, are
appropriate for showing the portion of a
category with respect to a whole.
In this example, each category of the budget is
shown as a section of the circle that represents
the portion it is with respect to the total
amount of the budget.

Scatterplot
Scatterplots are appropriate to display
bivariate data. They are helpful in describing
relationships between the two variables of the
data.
In this case, the two variables are grip strength
and arm strength. Each point represents a
single person the data was collected from. One
could conclude from this visual representation
that in general, the greater the grip strength a
person has, the greater the arm strength they
would have as well.

Page | 5
Line graphs

Line graphs are appropriate for displaying


changes over a period of time. For example,
the rate of the river discharge was collected
once a month through the year. This line
graphs shows and tracks the change over the
period of one year.

Practice

1) Identify the type of data collected.


State whether it is quantitative or qualitative. If quantitative, state whether it is discrete
or continuous data.

a) The temperatures over 3 days at a Mexico resort: 37oC, 34oC, 38oC.

b) The speed of cars going on the highway

c) The distances measured between the target at the dart shot at the target

d) A survey collected data on a scale of 1 to 5 about customer satisfaction

e) The brands of vehicle with the highest sales in 2016

f) The unemployment rate in Ontario collected over the past 6 months

g) Mercury levels in fish measured in parts per million

h) Body temperatures of patients

i) Movie ratings of movies currently in the box office.

j) Student clubs at Centennial College.

Page | 6
2) Determine the graphical representation that is the most appropriate for each data set.

a) The temperatures over 10 days in Toronto

b) The brands of vehicle and their sales

c) The unemployment rate in Ontario collected over the past 6 months

d) Mercury levels in fish and mercury levels in the water the fish live in

e) Movie ratings of movies currently in the box office.

f) Number of students in each student club at Centennial College with respect to the total
number of students who belong in clubs.

3) Given the following collection of data, complete the following questions:

Year Number of college students using


Electronic Textbooks
2010 223
2011 323
2012 356
2013 301
2014 521
2015 563

a) What type of data is this?

b) What is the most appropriate graph? Explain.

c) Sketch the graph.

Page | 7
4) Given the following collection of data, complete the following questions:

Ways college new graduates use to Frequency


find a job
Social media 556
Job Search Engines 44
Networking 280
Mass mailing 21

a) What type of data is this?

b) What is the most appropriate graph? Explain.

c) Sketch the graph.

Page | 8
1.2 – Frequency Polygon

Frequency of data refers to the number of times distinct values occur in a collection of data.
A frequency distribution shows how data are partitioned among several categories or classes
by listing the categories along with the number of data values in each of them

In general, to create a frequency distribution, the first step is to create a Frequency


Distribution Table. One needs to decide how the data will be organized. Data is sorted into
different classes, which are intervals. The frequency is recorded by counting the number of
data values belonging in each class.

Example 1
Create a frequency distribution table for the following set of data.
12 13 11 9 12 15 16 17 18 20 10

Observing the data set, a reasonable interval is 5. So, begin by creating the following
classes.

Classes Frequency
5-9
10-14
15-19
20-24

Notice that each class is equal in the length of the interval. From 5 to 9, five values are
included. From 10-14, five values are also included. This means that the class width is 5.
From each class, the lower number is known as the lower class limit, while the larger
number is known as the upper class limit. In the class 20-24, 20 is the lower class limit,
24 is the upper class limit. The difference between a lower class limit and the
subsequent lower class limit should always be equal to the class width.
Another important feature is that an upper class limit and the lower class limit of the
next class will never be the same. That is to say, the class after 5-9 will never be 9-14
and it will always begin with 10. This is to avoid data values being sorted into multiple
classes, which will be counted more than once in frequency.

Page | 9
To complete the frequency distribution table, count the number of values from the data
set that will belong into each class.

Classes Frequency
5-9 1
10-14 5
15-19 4
20-24 1

One type of graph that shows the frequency distribution of a data set is the Frequency Polygon.
A Frequency Polygon is constructed by taking the following steps:
1) Create a Frequency Distribution Table
2) Find the Class Mark, the value halfway between the upper and lower limits of each
class.
3) The class marks will be labelled on the x-axis
4) The y-axis will be used to record frequency
5) Plot points to indicate the frequency of each class
6) Connect the points with line segments

Example 2
Create a frequency polygon for the following set of data.
12 13 11 9 12 15 16 17 18 20 10

1) Create a frequency distribution Table

Classes Frequency
5-9 1
10-14 5
15-19 4
20-24 1

2) Find the Class Marks

The class mark in the first class, 5-9, is the halfway between 5 and 9. That would be 7,
which can be obtained by finding the average between 5 and 9.
Thus, the class marks are 7, 12, 17, and 22, respectively for each class.

Page | 10
3) Label the x-axis with the class mark and
4) Label the y-axis with the frequency.

5) Plot points to indicate the frequency of each class

6) Connect the points with line segments

Page | 11
Practice

1) Complete the frequency distribution for the following data sets


8.40 1.50 1.20 6.10 2.30
4.00 5.90 4.80 2.00 7.70

Classes Frequency
0.00-1.99
2.00-3.99

2) Create a frequency distribution for the following data set using a class width of 100.

350 550 250 170 490


510 350 350 630 780

3) The following set of data was collected from oxygen level readings of a local lake.
Readings are measured in mg/L.
Create a frequency distribution for the following data set by deciding on a
reasonable class width.

3.4 1.8 0.8 0.7 0.8


7.9 1.1 1.4 0.6 1.7

4) Create a frequency polygon to represent the frequency distribution of the above


data set of oxygen level readings.

Page | 12
5) The following set of data was collected from soil samples to understand diversity of
microorganisms. The collected microorganism DNA was measured in 𝜇𝑔/g of soil.
Create a frequency polygon to represent the frequency distribution of the data.
7.34 2.83 3.62 3.24 2.87
2.93 10.59 1.56 3.51 1.72
11.17 8.19 25.00 19.12 9.55

Page | 13
1.3– Histogram

A Histogram is another graphical representation of the frequency distribution of a collection of


data. Unlike the Frequency Polygon, bars are used to represent the frequency of data in each
class. Rather than using Class Marks to label the x-axis, Class Boundaries are used to label the x-
axis when creating a histogram.
A Class Boundary is a value that is halfway between the upper class limit of a class and the
lower class limit of its subsequent class. In other words, it is the value that separates two
classes.
A Histogram is constructed by taking the following steps:
1) Create a Frequency Distribution Table
2) Find the class boundaries
3) The class boundaries will be labelled on the x-axis
4) The y-axis will be used to record frequency
5) Draw bars to indicate the frequency of each class

Example
Create a histogram for the following set of data.
12 13 11 9 12 15 16 17 18 20 10

1) Create a frequency distribution Table

Classes Frequency
5-9 1
10-14 5
15-19 4
20-24 1

2) Find the Class boundaries

The class boundary between the first two classes is 9.5, since it is the halfway value
between 9, the upper class limit, and 10, the lower class limit of its subsequent class.
Thus, the class boundaries between the classes are 9.5, 14.5, 19.5. For the graph, we
also need to include the lowest boundary and the highest upper boundary. These are
4.5, which is just below the first class, and 24.5, which is just above the last class.

Page | 14
3) Label the class boundaries on the x-axis
These will be 4.5, 9.5, 14.5, 19.5, 24.5

4) Label the frequencies on the y-axis


These will be 1, 2, 3, 4, and 5, since the highest frequency is 5.

5) Draw bars to indicate the frequency of each class.

There are no gaps between the bars because the x-axis is representing intervals of
numbers and not categories. It is important not to confuse the histogram with bar
graphs.
Unlike a frequency polygon, the bars of a
histogram show a more rigid shape of the
frequency distribution of the data. It would
be easier to visualize if a curve is drawn
closely above the bars. Using the curve, we
can describe the shape of the frequency
distribution.

Page | 15
Types of Distribution Shapes

If the shape of the frequency distribution is


symmetrical and has the shape of a bell, this
is known as the normal distribution.

If the distribution has a higher frequency of


the left side, it will result in a shape with a
right tail.
This type of distribution is known as skewed
right.

If the distribution has a higher frequency of


the right side, it will result in a shape with a
left tail.
This type of distribution is known as skewed
left.

If the distribution curve is drawn flat. It


shows that the frequencies of all the classes
are equal.
This type of distribution is known as a
uniform distribution.

If the distribution curve has two peaks,


indicating that there are two classes with the
highest frequencies, the distribution is a
bimodal distribution.

Page | 16
Potential Outliers

Observing the frequency distribution is also helpful for identifying potential outliers.
The histogram below is representing the data set to its left.
Since the bar on the right that is away from the cluster of the data, we can estimate that the
data value 70 is potentially an outlier.

Practice
1) State the class boundaries of the following given classes

Classes Frequency
5-9 11
10-14 51
15-19 24
20-24 12

Page | 17
2) The following data set shows students’ test scores in percentage.
79 88 37 91 55
67 80 65 75 77

a) Create a histogram to represent the data.


b) Describe the shape of the distribution.
c) Using the histogram, estimate the class average of this test.
d) What data value(s) would be considered as a potential outlier?
e) Does the shape of the frequency distribution change if the potential outlier was
excluded in the data?
f) Observing the frequency distribution, what conclusions can be drawn about the test
results?

3) The following data set shows the departure delay times in hours of 16 flights
2 1 2 2 2 3 5 2
11 7 0 3 8 19 4 1

a) Create a histogram to represent the data.


b) Describe the shape of the distribution
c) Which data value(s) would be considered as a potential outlier?
d) What events might have led to the record of 19 hours of departure delay?
e) Observing the frequency distribution, how might airline companies use this
information?

4) The following data set shows the HDL cholesterol levels of 27 adult males
44 33 48 46 62 45 43 39 41
47 46 56 38 37 61 41 56 59
40 75 55 48 50 57 38 41 44

a) Create a histogram using a class width of 10.


b) Describe the shape of the distribution

Page | 18
c) Create another histogram using a class width of 20.
d) How do the two histograms compare?
e) Which histogram would you choose to represent the data?

Page | 19
1.4 – Stem-and-Leaf Plot

The third graphical representation of the frequency distribution of a data set is the Stem-and-
Leaf plot.
To create a Stem-and-Leaf Plot, data are also sorted into classes. Unlike, the frequency polygon
and the histogram, the classes in the Stem-and Leaf Plot will not be represented by Class Mark
or Class Boundaries. The classes in the Stem-and-Leaf Plot are represented by the leftmost
digit(s), which will make up the stem. The leaves of the plot will always consist of single digits
only, each taken from the rightmost digit from each data value.

Example 1
The following data set is the test scores in percentage of a class. Create a stem-and-leaf plot
to represent the frequency distribution of the data.
15 76 68 55 40
89 87 73 74 81

1) First, create the stem using the leftmost digit. In this case, the stem will consist of
the tens digit.

2) Each value of the step represents a class. For instance, the 1 of the stem represents
the class 10-19, and the 2 represents the class 20-29, etc.

3) Next, to create the leaves, list the rightmost digit of the data values in its
appropriate class. The leaves will be listed from least to greatest with the lowest
value closest to the stem.

Page | 20
It is important that the leaves are listed with equal spacing among them in order to
represent the frequency distribution accurately.

If we try to visualize a curve drawn over the leaves, we should recognize a frequency
distribution that is skewed left, since there are more values in the higher classes.

Example 2
The following data set is the test scores in percentage of a class. Create a stem-and-leaf plot
to represent the frequency distribution of the data.
125 176 200 123 226
189 142 173 162 181

Because the leaves must consist of only single digits, the values of the stem will
include 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22.
The leaves will be the the ones digit of the data values.

The shape of the frequency distribution is bimodal.

Page | 21
Variations of the Stem-and-Leaf Plot
There are several variations of the stem-and-leaf plot to accommodate data sets. The purpose it
to represent the data in a more meaningful way, to allow a better visualization of the frequency
distribution, or to compare two sets of data.
The first variation is known as the splitting stem-and-leaf plot.
Example 1
The following data set is the weights in pounds of large dogs
79 95 77 78 85 74
76 77 78 80 70 72

Create a stem-and-leaf plot


The values of the stem will include 7, 8, and 9.

With 12 data values and only 3 classes on the stem, it is difficult to see the
frequency distribution with most of the data clustered in the first class, 70-79.
To represent the data’s frequency distribution better, a splitting stem-and-leaf plot
can be used. Creating a splitting stem-and-leaf plot essentially allows the number of
classes to increase by “splitting” the classes.

The new plot shows that the stem consists of 7, 7, 8, 8, 9, and 9, where the classes
are 70-74, 75-79, 80-84, 85-89, etc. The class width of each class is now 5 rather
than 10.

Page | 22
Both the shapes of the frequency distribution are skewed right, but the splitting
stem-and-leaf plot also shows that the highest frequency of data occurs between
75-79.
The second variation is the side-by-side stem-and-leaf plot. This variation allows the
representation of two sets of data, which is convenient for comparison.
Example 2
The two data sets are gas prices recorded over 10 days in 2010 and 2016.
2010:
110.1 112.3 110.4 109.5 111.9
111.4 111.5 109.5 109.7 110.2

2016:
95.7 95.4 96.7 95.2 94.1
93.5 93.5 93.2 92.1 91.0

Create a stem-and-leaf plot to compare the two data sets.


A side-by-side stem-and-leaf plot will have one stem and two sets of leaves.
Considering both sets of data, the values of the stem will include 91, 92, 93, …, 112.
List the leaves using the tenths digit of the data. Data of 2016 are plotted on the
right of the stem and data of 2010 are plotted on the left of the stem.

Page | 23
Observing the side-by-side stem-and-leaf plot, we can conclude the following:
- The data in 2016 has a bimodal frequency distribution.
- The data in 2010 has a skewed right distribution because higher frequencies are found
in the lower classes
- In comparison, gas prices in 2016 are generally lower than gas prices in 2010.

Practice
1) The following data set is the weights in pounds of different models of compact cars.
2779 2795 2600 2775 2780
2775 2800 2753 2680 2560

a) Determine an appropriate class width and values for the stem


b) Create a stem-and-leaf plot for the data set
c) Describe the frequency distribution of the data set
d) What values could be considered as potential outliers?

2) The following data set is the age of college students in their first semester of the
Engineering program.
18 18 17 19 18
19 25 23 21 18
20 19 56 32 19
22 23 26 51 20

a) Determine an appropriate class width and values for the stem.


b) Create a stem-and-leaf plot for the data set.
c) Describe the frequency distribution of the data set.
d) What values could be considered as potential outliers?
e) If the college’s Engineering department is planning to open a new program
specifically designed for middle-aged students (students over 40), would this be a
good idea? Explain using your observations of the data set.

Page | 24
3) The following data set is the voltage measurements from a home.

123.3 123.4 123.3 123.3 123.6


123.5 123.5 123.7 123.6 123.7
123.9 124.0 124.2 123.9 123.8
123.8 123.8 124.0 124.1

a) Create a splitting stem-and-leaf for the data set.


b) Explain why a splitting stem-and-leaf is more suitable in representing the data than
a stem-and-leaf plot.
c) Describe the shape of the frequency distribution.

4) The following data sets are test scores collected from two Statistics classes. Students
from Class 1 studied from a hardcopy textbook and students from Class 2 studied from
an electronic textbook.
Class 1:
87 90 25 76 72
73 60 85 82 75
80 79 81 61 70

Class 2:
77 62 50 53 52
76 15 99 90 64
69 70 63 71 50

a) Create a side-by-side stem-and-leaf plot for the two sets of data.


b) Describe the shape of each frequency distribution.
c) What conclusions can be made from the comparison of these two sets of data?
d) What assumptions are made when comparing these two sets of data?

Page | 25
Unit 2: Descriptive Statistics

2.1 – Measures of Central Tendency

Measures of Central Tendency are values that are used to describe the center position of a set
of data. The values include the mean, median, and the mode.

The mean, usually known as the “average”, is the measure of center found by adding the data
values and then dividing the sum by the number of data values there are in a set of data.

Different symbols are used to represent the mean depending if it is calculated from a
population or a sample set of data.

Population Mean Sample Mean

Σ𝑥 Σ𝑥
𝜇= 𝑥̅ =
𝑁 𝑛

The Σ𝑥 symbol refers to the sum of all data values. The value of 𝑁 and 𝑛 refers to the
population size and sample size, respectively. The population or sample size is referring to the
number of data values in the data set.

Where necessary, the mean is rounded to one more decimal place than the original data values.

The median of a set of data is the measure of center that is the middle value when the data
values are arranged in order of increasing or decreasing magnitude.
The following formula computes the location of the median after the data is arranged.

𝑛+1
Location of 𝑥̃ =
2

The mode of a set of data is the value that occurs with the greatest frequency.

Depending on the frequency distribution of the data, in some cases, one measure of central
tendency would be a better representation than another measure.

Page | 26
Example 1

Collected from several households in a city, the following data set shows the weights of paper
recycled in kg per week.

6.96 6.83 11.42 16.08 6.38 13.05 11.36 15.09

Determine all the measures of central tendency for this sample.

The mean:

6.96 + 6.83 + 11.42 + 16.08 + 6.38 + 13.05 + 11.36 + 15.09


𝑥̅ =
8

𝑥̅ = 10.89625

𝑥̅ =̇ 10.896 𝑘𝑔

The median:

First, sort the data

6.38 6.83 6.96 11.36 11.42 13.05 15.09 16.08

Then, determine the location of the median

8+1
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥̃ = = 4.5
2

The location of the median is the 4.5th value in the sorted data set.

Because the 4.5th value is between the 4th and the 5th value, the median will be
determined by taking the mean of 11.36 and 11.42.

11.36 + 11.42
𝑥̃ = = 11.39
2

The mode:

In this case, each data value occurs only once. There is no mode.

Page | 27
Example 2

The following data set is the age of college students in their first semester of the Engineering
program.
18 18 17 19 18
19 25 23 21 18
20 19 56 32 19
22 23 26 51 20

Determine all the measures of central tendency for this sample.

The mean:

484
𝑥̅ =
20

𝑥̅ = 24.2 years old

The median:

20 + 1
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥̃ = = 10.5
2

The location of the median is the 10.5th value in the sorted data set.

Because the 10.5th value is between the 10th and the 11th value, the median will be
determined by taking the mean of 10 and 10.

𝑥̃ = 10

The mode:

There are two modes. Both 18 and 19 have a frequency of 4.

Interpretation of the Measures of Central Tendency

From the above example:

The measures of central tendency of the data set were:

Page | 28
Mean = 24.2
Median = 20
Mode = 18 and 19

Observe the frequency distribution of the data set

We can described the frequency distribution as skewed right. Data sets that have a
skewed right frequency distribution tend to have a higher mean than median. In this
case, the mean is affected by the extreme data values such as 51 and 56, which
bring the overall average higher. Since the median is location-dependent rather
than value-dependent, the median tends to be a measure of central tendency that
is not affected by extreme values.

Depending on the situation, one measure of central tendency might be a better


representation than another. In this example, because the distributive is skewed,
the median provides a better central value of the college students’ age in the
Engineering program.

In general, the following tends to be true for measures of central tendency:


- In a normal distribution, the mean, median, and mode will be the same
- In a skewed right distribution, the mean tends to be greater than the median, which is
greater the mode.
𝑚𝑒𝑎𝑛 > 𝑚𝑒𝑑𝑖𝑎𝑛 > 𝑚𝑜𝑑𝑒
- In a skewed left distribution, the mean tends to be lower than the median, which is
lower than the mode.
𝑚𝑒𝑎𝑛 < 𝑚𝑒𝑑𝑖𝑎𝑛 < 𝑚𝑜𝑑𝑒

Page | 29
Practice

1) For the given data sets, determine all the measures of central tendency.

Weights of Household Magnitudes of Levels of Harmful Food


Organic Wastes (kg) Earthquakes Additive (PFC) found in
Food Packaging (%)
2.73 0.20 0.0504
9.31 1.64 0.1800
3.59 1.32 0.0110
5.36 2.95 0.6010
1.47 0.90 1.0900
7.06 1.76 0.0210
2.52 1.01
1.26
0.00
0.65

2) For each of the above data sets, observe the measures of central tendency and
predict the shape of the frequency distribution for each.

Page | 30
2.2 – Measures of Variation

Measures of variation refers to dispersion of the data in a data set.


The measures of variation are range, standard deviation, and variance.

The range of the data values is the difference between the maximum data value and the
minimum data value.

𝑅 = 𝑟𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒

The standard deviation of a set of values is a measure of how much data values deviate away
from the mean.
A slightly different formula is used to calculate the standard deviation of a population
versus the sample.
There are two formulas that could be used to calculate the sample standard deviation.
The first one involves the mean, where the second one does not.

Population Sample

Σ(𝑥 − 𝜇)2 Σ(𝑥 − 𝑥̅ )2


𝜎= √ 𝑠=√
𝑁 𝑛−1

𝑛(Σ𝑥 2 ) − (Σ𝑥)2
𝑠=√
𝑛(𝑛 − 1)

The variance of a set of values is a measure of variation equal to the square of the standard
deviation
Population Sample

Σ(𝑥 − 𝜇)2 Σ(𝑥 − 𝑥̅ )2


𝜎2 = 𝑠2 =
𝑁 𝑛−1

Page | 31
Example 1

The following set of data shows the amount of time in minutes students took to complete an
online quiz.
Determine the range, standard deviation, and variance.

12 18 17 10 11
52 12 15 16 10

The range:
The maximum value is 52
The minimum value is 10
𝑅 = 52 − 10
𝑅 = 42

Standard deviation:
First, find the mean of the data set
𝑥̅ = 17.3

Σ(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1
Observing the formula, next, find the square of the difference between each
data value and the mean.

𝒙 𝒙−𝒙 ̅ ̅) 𝟐
(𝒙 − 𝒙
12 12 – 17.3 = -5.3 28.09
18 0.7 0.49
17 -0.3 0.09
10 -7.3 53.29
11 -6.3 39.69
52 34.7 1204.09
12 -5.3 28.09
15 -2.3 5.29
16 -1.3 1.69
10 -7.3 53.29

Page | 32
The sum of the squares of the difference between each data value and the
mean:
Σ(𝑥 − 𝑥̅ )2 = 1414.10

Calculating standard deviation

Σ(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1

1414.10
𝑠=√
10 − 1

𝑠 = 12.5348403349314
The standard deviation is expressed with rounding of one more decimal place
than the given data.

𝑠 =̇ 12.5

This means that on average, the data values are 12.5 minutes deviating from the
mean of 17.3 minutes.

Variance:

𝑠 2 = (12.5348403349314)2
𝑠 2 =̇ 157.1

Variance is a measure of variation that is influenced greatly by outliers.


Observe the set of data given in this example:
12 18 17 10 11
52 12 15 16 10

The data value 52 is a potential outlier.


If this value is taken out of the data set and we calculate the measures of variation
again, then the following are the range, standard deviation, and variance:
𝑅=8
𝑠 =̇ 3.1
2
𝑠 =̇ 9.53

Page | 33
Compared to the previous measures of variations, where
𝑅 = 42
𝑠 =̇ 12.5
𝑠 2 =̇ 157.1
Variance is the measure that tends to be most dramatically influenced by potential
outliers.

Why are measures of variation useful?

The measures of variation are useful for describing the dispersion of data. This is because
strictly relying on measures of central tendency as a description of a set of data might not be
enough and might be even misleading in some cases.

Observe the following data sets:

Data set 1:
16 18 20

Data set 2:
2 18 34

Both data sets have a mean of 18 and a median of 18. However, just by observation, we can
conclude that Data set 2 is more dispersed than Data set 1. The values in Data set 1 are more
consistent with one another. On the other hand, values in Data set 2 are further apart from one
another. Calculating measures of variation can confirm this observation:

Data set 1 Data set 2


𝑅 = 20 − 16 = 4 𝑅 = 34 − 2 = 32
𝑠=2 𝑠 = 16
𝑠2 = 4 2
𝑠 = 256
The range of the data set is 4 The range of the data set is 32
On average, data values are deviating by 2 On average, data values are deviating by
from the mean 16 from the mean

Page | 34
As a measure of variation, standard deviation is particularly useful in helping us identify unusual
values. Unusual values in a data set are those that deviate by 2 standard deviations from the
mean.

Example 2

The following set of data shows the amount of sodium measured in grams found in 5 samples
of canned food products.

3.4 2.5 5.5 3.2 11.0

Determine the measures of variation.


Determine, if any, the unusual values in the data set.

The range:
𝑅 = 11.0 − 2.5 = 8.5
The range of the data set is 8.5

Standard deviation:
𝑥̅ = 5.12

𝒙 𝒙−𝒙 ̅ ̅) 𝟐
(𝒙 − 𝒙
3.4 3.4 – 5.12 = -1.72 2.9584
2.5 -2.62 6.8644
5.5 0.38 0.1444
3.2 -1.92 3.6864
11.0 5.88 34.5744

Σ(𝑥 − 𝑥̅ )2 = 48.2280
48.2280
𝑠=√
5−1
𝑠 = 3.47231910975935
𝑠 =̇ 3.47

Variance:

𝑠 2 = (3.47231910975935)2
𝑠 2 =̇ 12.06

Page | 35
If there are any unusual values, they would be 2 standard deviations away from the
mean.
𝑥̅ ± 2𝑠
= 5.12 ± 2(3.47)
= 5.12 ± 6.94

This means that values that are greater than 12.06 or less than −1.82 would be
considered as unusual values. There are no such values in the set of data.

Practice

1) Collected from a local grocery store, the following data set is the sales collected over 6
days. Values are in the unit of dollars in thousands.

23 44 31 32 25 30

Determine all the measures of variation for this sample.

2) The following data set is the weight of household organic wastes measured in kilograms
2.73 9.31 3.59 5.36 1.47 7.06 2.52

a) determine all the measures of variation


b) determine, if any, the unusual values in the data set

3) The following data set is the magnitude of earthquakes collected from a sample of 9
earthquakes.
0.27 1.64 1.32 3.44 0.91 1.76 1.01 1.26 0.02

a) determine all the measures of variation


b) determine, if any, the unusual values in the data set

Page | 36
2.3 – Measures of Position

Measures of Position are also known as Measures of Relative Standing. These are values that
expresses the location of the data value relative to all the other values in the same set of data.
In other words, measures of position are used to express the ranking of a data value.
The measures of position that will be discussed in this chapter are Percentiles and Quartiles.

Percentiles are measures of location that divides a data set into 100 groups with about 1% of
the total number of data values contained in each group. The 𝑘 𝑡ℎ percentile, denoted by 𝑃𝑘 , is
the data value that is at a location that separates 𝑘% of the data below and (100 − 𝑘)% of the
data above.
For example, if a student’s test score is at the 90th percentile, it might be considered a very high
test score compare to the class average because that test score has 90% of the scores below it
and 10% of the scores above it. In other words, that student scored better than 90% of his/her
classmates.

The formula to determine the location of the 𝑘 𝑡ℎ percentile, 𝑃𝑘 is:


𝑘
𝐿= (𝑛)
100
Where 𝑘 is the value from 𝑘 𝑡ℎ percentile and 𝑛 is the number of data values in the data set.

Quartiles are measures of location that divides a data set into four groups with about 25% of
the data values in each group. The symbols, 𝑄1 , 𝑄2 , 𝑄3 represent First Quartile, Second Quartile,
and Third Quartile.

The First Quartile is the data values that is at a location that separates 25% of the data below
and 75% of the data above. So 𝑄1 is equivalent to 𝑃25 .
The Second Quartile is the data values that is at a location that separates 50% of the data below
and 50% of the data above. So 𝑄2 is equivalent to 𝑃50 .
The Third Quartile is the data values that is at a location that separates 75% of the data below
and 25% of the data above. So 𝑄3 is equivalent to 𝑃75 .
Thus, the formula to determine the location of the quartile is the same as above.

The Interquartile Range is the range defined by the difference between the Third and the First
Quartile
𝐼𝑄𝑅 = 𝑄3 − 𝑄1

Page | 37
Example 1

The following set of data is the final marks of students in a Statistics class.

79 88 37 91 55
67 80 65 75 77

a) Determine the mark that is the 90th percentile

The location of 𝑃90


90
𝐿= (10)
100
𝐿=9

Sort the data from least to greatest

37 From the above calculations, 𝑃90 has a location of 9.


55
The 𝑃90 value is always taken midway between the 9th and
65
the next value, the 10th value.
67
75 Counting least to greatest, the 9th and 10th values are
respectively 88 and 91.
77
79 88+91
So the 90th percentile is 2
= 89.5
80
88 Notice, 89.5 is the value separating the lower 9 values, which
is 90% of the data, and the upper 1 value, which is 10% of
91
the data.

b) Determine the mark that is the 42nd percentile

The location of 𝑃40


42
𝐿= (10)
100
𝐿 = 4.2
If the location is not a whole number, always round up to the next whole
number.
𝐿=5
The value of 𝑃40 will be the 5 value, which is the mark of 75.
th

Page | 38
c) Determine the mark that is the First Quartile

The location of 𝑄1
25
𝐿= (10)
100
𝐿 = 2.5

The location is rounded up to the next whole number, 𝐿 = 3


The value of 𝑄1 is the 3rd value, which is 65.
d) Determine the Interquartile range

𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝑄1 = 65 from the above calculations.

Finding 𝑄3 :

75
𝐿= (10)
100
𝐿 = 7.5

The location is rounded up to the next whole number, 𝐿 = 8


The value of 𝑄3 is the 8th value, which is 80.

So the 𝐼𝑄𝑅 = 80 − 65
𝐼𝑄𝑅 = 15

From the above example, a reminder that:

1) If 𝐿 is a whole number, then 𝑃𝑘 is midway between the 𝐿𝑡ℎ value and the next value in
the data set that is sorted from least to greatest

2) If 𝐿 is not a whole number, then 𝐿 is rounded up to the next whole number. 𝑃𝑘 will be
located at the rounded value of 𝐿.

Page | 39
Why is the Interquartile Range useful?

Because the 𝐼𝑄𝑅 is also a range, it can be used to help describe the dispersion of data. But
more importantly, the 𝐼𝑄𝑅 can be used to identify outliers.

A data value is considered to be an outlier if it is:


- above 𝑄3 by an amount greater than 1.5 × 𝐼𝑄𝑅
- Or below 𝑄1 by an amount greater than 1.5 × 𝐼𝑄𝑅

From the above example, the 𝐼𝑄𝑅 was calculated to be 15.


1.5 × 15 = 22.5
𝑄3 + 22.5 = 80 + 22.5 = 102.5
𝑄1 − 22.5 = 65 − 22.5 = 42.5

That means, any values from the data set that is less than 42.5 or greater than 102.5 would be
outliers.
The data set was
37 55 65 67 75 77 79 80 88 91

So the data value 37 is an outlier because it is less than 42.5.

Practice

1) Given the following data set, determine all the quartiles.


8.40 1.50 1.20 6.10 2.30
4.00 5.90 4.80 2.00 7.70

2) Referring the data set in Question 1, determine the interquartile range.

3) Referring to the data set in Question 1, determine, if any, the outliers of the data set.

Page | 40
4) The following data set is the price of gasoline (cents/L) collected from various cities in
Canada in February 2016:
St. John's 89.7
Charlottetown 86.7
Halifax 86.2
Saint John 86.9
Québec 90.1
Montréal 98.1
Ottawa 85.1
Toronto 89.8
Thunder Bay 87.1
Winnipeg 74.7
Regina 72.5
Saskatoon 74
Edmonton 63.1
Calgary 73.5
Vancouver 105.9
Victoria 98.1
Whitehorse 90.8

Determine the following.

a) 𝑃60
b) 𝑄3
c) 𝑃50
d) 𝑄2
e) 𝑃85
f) Which cities have gasoline prices that are below the 20th percentile?
g) Which cities have gasoline prices that are considered to be the outlier?

5) How do the median, the second quartile, and the 50th percentile compare? Explain.

Page | 41
Unit 3: Descriptive Methods for Bivariate Data

3.1 – The Least Squares Regression Line

Recall from Unit 1 that bivariate data are data that have two variables. For example, if
students were asked the number of hours of studying they did and the final course mark, the
data collected would have two variables: hours of studying and course mark. Often bivariate
data are collected to determine the relationship between the two variables. For example, we
might be wondering whether more hours of studying done relates to higher course mark.

Bivariate data are usually displayed using a scatterplot. We considered the two variables to
have a correlation between one another when the values of one variable are associated with
the values of the other. When plotted into a scatterplot, if the plotted points display a pattern
that can be approximated by a straight line, then the two variables are considered to have a
linear correlation.

For example, below are two sets of bivariate data.

Example 1

The following is a set of data of the amount that a TV repair technician charges for his/her service.
There is a $15 flat rate of service charge and a $20 per hour labour charge.

Hours worked Cost for labour


and service
charge

0 15
1 35
2 55
3 75

Page | 42
Example 2

The following is a set of data relating the years of experience an employee has an his/her salary in
thousands of dollars

Years of Salary in
experience Thousands
12 39
16 41
6 33
23 44
27 48
8 34
5 32
19 44
23 46
13 37
16 43
8 37

The difference between Example 1 and Example 2 is that the data values in Example 1 follows a
perfect straight line when graphed. The cost of labour was calculated from the amount of hours
worked. In Example 2, there are some variation in the data values. Its scatter plot shows that
the data values plotted cluster around a straight line. Given such a pattern can be observed, we
can say that the salary of an employee has a linear correlation with the employee’s years of
experience.
A straight line can be drawn to approximate the pattern. This is called a trend line or a line of
best fit. It is drawn as close as possible to every point on the scatter plot.

Page | 43
Both of the above scatterplots show a positive linear correlation because as one variable
increases, the other variable changes the same way.
When comparing Example 1 and Example 2, Example 1 has a stronger linear correlation than
there is in Example 2 because the points are closer to the line of best fit.

Types of Correlation

The following six diagrams show the types of correlation there are. A and B are examples of
negative linear correlation. D and E are examples of positive linear correlation. When
comparing, A and D have a stronger linear correlation than B and E. Finally, we can say that C
and F do not follow a linear correlation as a straight line cannot be used to approximate the
trend of the data.

If the data values cluster around a horizontal line or a vertical line, then there is no
correlation between the variables.

Page | 44
The Line of Best Fit

The line of best fit can be used to provide a visual representation, making it more convenient
for us to decide the type of correlation the data show. However, drawing the line of best fit can
be very subjective. It is an estimation. If the scatter plot is used to predict data, than a hand-
drawn line of best fit will not give very accurate results.

The equation of a line is written in the form of


𝑦 = 𝑚𝑥 + 𝑏
where
𝑥 is known as the independent variable
𝑦 is known as the dependent variable

𝑚 is the slope, which describes the steepness of the line


𝑏 is known as the 𝑦-intercept, the point at which the graph crosses the 𝑦-axis.

Given a set of bivariate data, the line of best fit is modeled by the equation

𝑦̂ = 𝑏1 𝑥 + 𝑏0

where
𝑥 represents the data value of the variable graphed along the 𝑥-axis
𝑦̂ represents the 𝑦 values on the best fit line corresponding to the data values of the
independent variable.
𝑏1 is the slope of the line of best fit
𝑏0 is the 𝑦-intercept of the line of best fit.

If the slope of the line of best fit, 𝑏1 , is positive, then the data set has a positive linear
correlation. If 𝑏1 is negative, then the data set has a negative linear correlation.

In this scatter plot, a line of best fit is drawn. A point of the line of best fit would have a
coordinate of (𝑥, 𝑦̂). Directly above or below that point is the actual data value plotted, which
would have a coordinate of (𝑥, 𝑦). The distance in between, which is 𝑦̂ − 𝑦, is known as the
error. Since the line of best fit should be drawn as close as possible to all the data points, this
means that the location of the line must be chosen where the total error is minimized.

Page | 45
(𝑥, 𝑦̂)

Calculations show that if the slope, 𝑏1 , is the following:

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)


𝑏1 = 2
𝑛(∑ 𝑥 2 ) − (∑ 𝑥)

and the 𝑦-intercept, 𝑏0 , is the following:

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
then the error is minimized.

Thus, the line of best fit has the following equation:


𝑦̂ = 𝑏1 𝑥 + 𝑏0
with

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅


𝑏1 = 2
𝑛(∑ 𝑥 2 ) − (∑ 𝑥)

Reminder that 𝑥̅ is the mean of the 𝑥 values, and 𝑦̅ is the mean of the 𝑦 values.

Page | 46
Example 1

Determine the linear regression line for the following data set.
Create a scatterplot and draw the line of best fit.

𝒙 𝒚
2 16

5 15

3 15
8 21

First, calculate 𝑏1 . The following table would help.

𝒙 𝒚 𝒙𝒚 𝒙𝟐
2 16 32 4
5 15 75 25
3 15 45 9
8 21 168 64
Sum ∑ 𝑥 = 18 ∑ 𝑦 = 67 ∑ 𝑥𝑦 = 320 ∑ 𝑥 2 = 102

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)


𝑏1 = 2
𝑛(∑ 𝑥 2 ) − (∑ 𝑥)

(4)(320) − (18)(67)
𝑏1 =
(4)(102) − (18)2

1280 − 1206
𝑏1 =
408 − 324

74
𝑏1 =
84

𝑏1 = 0.880952381 …

𝑏1 =̇ 0.88
Page | 47
Next, calculate 𝑏0

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅

∑𝑦 ∑𝑥
𝑏0 = − 𝑏1 ( )
𝑛 𝑛

67 18
𝑏0 = − (0.880952381) ( )
4 4

𝑏0 = 16.75 − 3.964285715

𝑏0 = 12.78571429

𝑏0 =̇ 12.79

So the equation of the line of best fit is

𝑦̂ = 0.88𝑥 + 12.79

Create a scatter plot by plotting the data values.

The equation of the line of best fit shows that the 𝑦-intercept is 12.79. Plot 12.79 as the

first point on the 𝑦-axis. Find the next point by substituting another 𝑥 value to find 𝑦̂.

The two variables from this data set have a positive linear correlation with one another.

Page | 48
Appropriateness of linear regression
Sometimes data do not follow a linear tendency and the points on the scatter plot do not
cluster about a straight line with a nonzero slope. When using the method above to determine
the Regression line in these kinds of situations, we will find a line that is not going to be very
helpful in predicting the 𝑦 values for a given 𝑥 value. To distinguish between appropriate and
inappropriate situations for using the linear regression we will later introduce the linear
correlation coefficient.

Two examples are shown below. For the suggested curve relationship, we can still find the line
of best fit, but the predictions for 𝑦 will be inaccurate.

Practice

1) Determine the linear regression line for the following sets of data.

Using the equation, state the type of linear correlation for the set of data.

a) .

𝒙 𝒚
12 6
13 7
8 4
14 8
20 10

Page | 49
b) .

𝒙 𝒚
11 9
14 10
18 8
7 14
9 11

2) The following set of data shows the amount of recycling in kilograms collected from

households

Paper (kg) Plastic (kg)


2.41 0.27
7.65 1.45
9.55 2.81
8.80 2.56
6.97 1.91

a) Predict the correlation between the amount of paper recycled and the amount of
plastic recycled by households.

b) Determine the linear regression line.

c) Describe the type of linear correlation.

d) Describe the relationship between the amount of paper recycled and the amount of
plastic recycled by households.

Page | 50
3) The following set of data shows the weight of a vehicle (in 1000 lbs) and its fuel
efficiency (in miles/gallon).
Weight of vehicle Fuel efficiency
(1000 lbs) (miles/gallon)
3.56 32
2.70 33
3.92 25
3.25 34
4.07 25
2.46 38
3.55 27

a) Determine the linear regression line.

b) Describe the type of linear correlation.

c) Describe the relationship between the weight of a vehicle and its fuel efficiency.

Page | 51
3.2 – The Linear Correlation Coefficient

In the previous chapter, we discussed the different types of linear correlations. We can describe
a linear correlation by stating whether it is positive or negative. Also, we can describe it by
stating whether it is strong or weak. How strong a linear correlation is depends on how close
the data values are to the linear regression line. But, how close do they have to be in order to
be considered as a strong correlation? Using our observations of a scatter plot and the linear
regression line is very subjective.
Thus, the Linear Correlation Coefficient is used to measure the strength of the linear
correlation between the 𝑥 and 𝑦 values in a set of data.

Linear Correlation Coefficient

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


𝑟=
(𝑛 − 1)𝑠𝑥 𝑠𝑦

where
𝑟 represents the linear correlation coefficient;
𝑥 and 𝑦 represent the values of the independent and the dependant variables;
𝑥̅ 𝑎𝑛𝑑 𝑦̅ represent the mean of the 𝑥 and 𝑦 values respectively;
𝑛 represents the number of values in the sample;
𝑠𝑥 𝑎𝑛𝑑 𝑠𝑦 represent the values for the sample standard deviation of 𝑥 and 𝑦 set of
values, respectively.

An alternative formula for 𝑟 is:


𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
𝑟=
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 × √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2

This formula tends to be more convenient because the standard deviations do not need
to be calculated.

Properties of the linear correlation coefficient

Page | 52
1. The value of 𝑟 is −1 ≤ 𝑟 ≤ 1.

2. 𝑟 = 1 means all the points from the


scatter diagram are exactly on the
line and the line has a positive
slope; 𝑟 = −1 means all the points
from the scatter diagram are on the
line and the line has a negative
slope.

3. The closer the value of 𝑟 is to 1 or


−1 the stronger the linear
correlation.

4. If 𝑟 is close to 0, there is little or no


linear correlation between x and y
values. If 𝑟 = 0, there is no linear
correlation whatsoever. In this case
it can be shown that the line of best
fit is a horizontal line.

Page | 53
Example
Determine the linear correlation coefficient for the following data set
Hours of Test
studying score
0 50

0 46

1 73
2 72

2 88

Using the following formula:


𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
𝑟=
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 × √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2

First, create a table to complete the following calculations


𝒙 𝒚 𝒙𝒚 𝒙𝟐 𝒚𝟐
0 50 0 0 2500
0 46 0 0 2116
1 73 73 1 5329
2 72 144 4 5184
2 88 176 4 7744
Sum ∑𝑥 = 5 ∑ 𝑦 = 329 ∑ 𝑥𝑦 = 393 ∑ 𝑥2 = 9 ∑ 𝑦 2 = 22873
So
(5)(393) − (5)(329)
𝑟=
√(5)(9) − (5)2 × √(5)(22873) − (329)2
320
𝑟=
√20 × √6124
𝑟 = 0.914360359 …
The linear correlation coefficient is approximately 0.91.
This indicates that the relationship between hours of studying and test score is a strong,
positive linear correlation. As the hours of studying increase, the test score also
increase.

Page | 54
Coefficient of Determination
When 𝑟 is calculated, we can say that for 𝑟 = 0.91, it is a strong linear correlation. But if 𝑟 has a
value of 0.70, it is hard to decide if the correlation is strong enough. This question led to the
introduction of a new coefficient called the coefficient of determination.
To understand the coefficient of determination, we need to introduce some new terms:

a) The explained deviation is the difference between the value of 𝑦 from the regression
line and the mean value for 𝑦:
𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦̂ − 𝑦̅

b) The unexplained deviation is the difference between the value of 𝑦 and the value of 𝑦
from the regression line:

𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦 − 𝑦̂

The total deviation is = explained deviation + unexplained deviation


Now, instead of looking at individual 𝑦 values we will look at all values of y taken together and
we will use the variation measured by the sum of squares of deviation.

∑(𝑦 − 𝑦̅)2 = ∑(𝑦̂ − 𝑦̅)2 + ∑(𝑦 − 𝑦̂)2

The above equation can be expressed in words as:


Total sum of squares = Sum of squares from regression + Sum of squares for error
𝑇𝑆𝑆 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
The total variation is a measure of variability in the 𝑦 value. The proportion of total variation
explained by the regression line is:
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑆𝑅
=
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑇𝑆𝑆

The closer the points are to the line, the greater the proportion of explained variation.
The proportion of explained variation is the square of 𝑟:
𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑆𝑅
𝑟2 = =
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑇𝑆𝑆

Page | 55
The coefficient of determination is a better measure because it represents the proportion of
variation in the dependent variable that can be explained by the variation in the independent
variable.

From the previous example, 𝑟 =̇ 0.91. So 𝑟 2 = 0.8281


This means that about 82.81% of the variation in test scores can be explained by the variation
in hours of studying.

Practice
1) Determine the linear correlation coefficient and coefficient of determination of the

following sets of data.

a) .

𝒙 𝒚
11 19
15 27
16 31
20 29

b) .

𝒙 𝒚
32 5
26 9
28 12
23 16
33 5

2) If the linear correlation coefficient is -0.22 between two variables, what conclusion can
be drawn regarding their relationship?

Page | 56
3) If the coefficient of determination is 0.67 between two variables, what conclusion can
be drawn regarding their relationship?

4) The following set of data shows the weight of a vehicle (in 1000 lbs) and its fuel
efficiency (in miles/gallon).

Weight of vehicle Fuel efficiency


(1000 lbs) (miles/gallon)
3.56 32
2.70 33
3.92 25
3.25 34
4.07 25
2.46 38
3.55 27

a) Determine linear correlation coefficient


b) Using 𝑟, describe the correlation between weight of a vehicle and its fuel efficiency.
c) Determine the linear regression line
d) Using 𝑟 2 , state the amount of variation in fuel efficiency that can be explained by
the variation in weight of vehicle.

Page | 57
Unit 4: Discrete Probability Distributions

4.1 – Introduction to Probability and Random Variables

Basic Concepts of Probability

Experiment – any activity that yields a result or an outcome

Sample Space – the collection of all possible distinct outcomes that can occur when an
experiment is performed

Event – a subset of the sample space

Probability – the expected portion of occurrences of an event

Example 1

Tim Hortons, a Canadian coffee franchise, launched the campaign, “Roll Up the Rim”, for
customers to have a chance to win prizes with the purchase of a beverage. It is stated that “the
odds of winning a prize are approximately 1 in 6. Tim Hortons audits and monitors the odds
daily to ensure the odds remain the same.
In this scenario, the experiment would be the activity in rolling up the rim of a coffee cup to see
whether or not you have won any prizes. The sample space in general is “winning a prize” or
“please play again”. If we consider the specific outcomes then “winning a prize” would include
“winning a coffee”, “winning a donut”, “winning a car”, etc.

Let’s consider two events in this experiment: “winning a prize” and “not winning”.
Given the odds In winning are 1 in 6, this means that the probabilities of winning and not
winning are as follows:

1 1
𝑃(𝑤) = 7 and 𝑃(𝑤
̅) = 7

where 𝑤 represents the event of winning, and 𝑤


̅, which is also known as a
complementary event, is the event of not winning.

Page | 58
Example 2

1) Determine the sample space in an experiment of rolling one die

Use set notations to denote the set of all possible outcomes:


𝑆 = {1, 2, 3, 4, 5, 6}

2) Determine the sample space in an experiment of rolling two dice where the order
of the outcome does matter

In this case, we will use (𝑎, 𝑏) to denote the 𝑎 as the outcome from the first roll
and 𝑏 as the outcome from the second roll
(1,1) ; (1,2) ; (1,3) ; (1,4) ; (1,5) ; (1,6)
(2,1) ; (2,2) ; (2,3) ; (2,4) ; (2,5) ; (2,6)
(3,1) ; (3,2) ; (3,3) ; (3,4) ; (3,5) ; (3,6)
𝑆=
(4,1) ; (4,2) ; (4,3) ; (4,4) ; (4,5) ; (4,6)
(5,1) ; (5,2) ; (5,3) ; (5,4) ; (5,5) ; (5,6)
{(6,1) ; (6,2) ; (6,3) ; (6,4) ; (6,5) ; (6,6)}

3) Determine the probability of rolling an even number when rolling one die
Let 𝐴 represent the event of rolling an even number
3 1
𝑃(𝐴) = =
6 2

4) Determine the probability of rolling an even number twice in a row


1 1 1
𝑃(𝐴) × 𝑃(𝐴) = × =
2 2 4

5) Determine the probability of rolling repeating digits when rolling two dice
Let 𝐵 represent the event of rolling repeating digits
6 1
𝑃(𝐵) = =
36 6

Page | 59
6) Determine the probability of rolling no repeating digits when rolling two dice
𝑃(𝐵̅ ) = 1 − 𝑃(𝐵)
1 5
=1− =
6 6

7) Determine the probability of rolling a sum of 4


Let 𝐶 represent the event of rolling a sum of 4
𝐶 = { (1,3) ; (2,2) ; (3,1)}
3 1
𝑃(𝐶) = =
36 12

Random Variables

A random variable is a variable with possible values that are numerical outcomes of a random
phenomenon. A random variable is assigned to an experiment to represent all the distinct
outcomes that could occur. A random variable could be either discrete or continuous.

A discrete random variable is one that can only have distinct, countable values.

A continuous random variable is one that has uncountable and infinite amount of values.

Example

1) A medical equipment company is collecting data on its sales every day between April
and May. Prior to April, a clinic had already committed to purchasing 10 units per day
from the company. Assign a random variable to represent the possible outcomes.

Let 𝑋 represent the amount of units sold every day between April and May.
Because 𝑋 could be any amount greater than or equal to 10, 𝑋 ≥ 10
This random variable, 𝑋, is a discrete random variable because the values of 𝑋 are
countable values. For example, 𝑋 can be 10, 11, 12, 13… and so on.

Page | 60
2) A meteorologist is collecting data on the daily highest temperature between May and
June.
Assign a random variable to represent the possible outcomes.

Let 𝑋 represent the daily highest temperature between May and June.
Because 𝑋 could be any value and there are infinite values, 𝑋 is a continuous random
variable. In other words, 𝑋 could be any value from negative to positive. 𝑋 could be
1.1𝑜 𝐶, 1.2𝑜 𝐶, or 1.21𝑜 𝐶, and so on. This shows that there are infinite possible values.

Practice

1) In an experiment of tossing three coins, one after another:


a) State the sample space
b) What is the probability of getting 2 heads?
c) What is the probability of getting no heads?
d) What is the probability of getting the same result consecutively?
2) In an experiment of rolling three die, one after another:
a) How many possible outcomes are there in the sample space?
b) State the outcomes that has the number 3 in its first roll and 6 in its third roll.
c) What is the probability of getting the same result consecutively?
3) In a multiple choice test, there are 10 questions.
a) Assign a random variable to represent the number of questions answered
correctly.
b) What is the random variable’s possible range of values?
c) Is the above random variable discrete or continuous? Explain
d) If for each question, there are 5 choices (a, b, c, d, e), then what is the probability
for a student to guess one question correct?
e) What is the probability for the student to guess all 10 questions correct?

Page | 61
4) In a chemistry experiment, the temperature change of a chemical reaction was
recorded. The chemist stated that the chemical reaction is an exothermic reaction,
meaning that the temperature will only increase. This means that the temperature
change will only be positive.
a) Assign a random variable to represent the possible temperature change that
could be recorded.
b) What is the random variable’s possible range of values?
c) Is the above random variable discrete or continuous? Explain.
d) What is the probability for the temperature change of the reaction to be −2𝑜 𝐶?

Page | 62
4.2 – Introduction to Discrete Probability Distribution

We learned from the previous chapter that a random variable can be used to represent the
possible outcomes of an experiment. If we wish to represent the probability of each of the
possible outcomes, we can use probability distribution.

A probability distribution is a description that allows the probability of each value of the
random variable to be represented. This representation can be in the format of a table,
formula, or graph.

We will focus on representing probability distributions using a table and a histogram. Recall
that there are two types of random variable, discrete and continuous. Therefore, there are two
types of probability distributions, discrete and continuous.

A discrete probability distribution is a probability distribution that is associated with the


distinct values of a discrete random variable.

A continuous probability distribution is a probability distribution that is associated with the


values of continuous random values.

Example 1

Using a table and a histogram, represent the probability distribution of rolling a die.

First, create a table. Let 𝑥 be the random variable representing the outcomes from rolling a die.

𝒙 𝑷(𝒙)
1 1
6
2 1
6
3 1
6
4 1
6
5 1
6
6 1
6

Page | 63
Because the random variable is a discrete one, this is a discrete probability distribution.

The histogram:

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0
1 2 3 4 5 6

Observing the histogram, we can conclude that the probability distribution of rolling a
die is a uniform distribution.

Example 2

A survey was given to 100 people to determine how they rate the customer service of a
company. People completing the survey could choose from 1 to 5, a rating of satisfaction.

The survey shows that 23% of the people rated 1, 14% rated 2, 7% rated 3, 30% rated 4, and
26% rated 5.

Using a table and a histogram, represent the probability distribution.

Let 𝑥 represent the rating values, which are {1, 2, 3, 4, 5}

With the given information, we know that the probability of finding a person who rated
1 is 23%. The remaining probabilities will be determined the same way.

Page | 64
𝒙 𝑷(𝒙)
1 0.23
2 0.14
3 0.07
4 0.30
5 0.26

Using the table, the histogram will be created as follows:

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
1 2 3 4 5

Observing the probability distribution from the histogram, we can conclude that the
distribution is bimodal.

Properties of the Probability Distribution

1) The sum of the probabilities of each value of the random variable is 1

∑ 𝑃(𝑥) = 1

2) The sum of the area of the bars on the probability distribution histogram is 1

Page | 65
3) Each value of 𝑃(𝑥) is between 0 and 1, inclusively
0 ≤ 𝑃(𝑥) ≤ 1

Practice

1) In an experiment, two coins are tossed, one after another. The results of interest are the
amount of Tails that show up in the outcome.
a) Assign a random variable to represent the amount of Tails in the coin toss results.
b) Create a probability distribution table for this experiment.
c) Create a histogram of the probability distribution.
d) Describe the shape of the probability distribution.

2) A student is guessing a multiple choice questions on a quiz. There are 5 choices for the
question.
a) Create a probability distribution table for this experiment.
b) Create a histogram for the probability distribution.
c) Describe the shape of the probability distribution.

3) In a college survey, students were asked what their preference was for their mode of
learning. The following table shows the results.

Mode of learning Percentage of Students


Responded
Face-to-face 38%
Online 25%
Blended 32%
Don’t Know 2%
No preference 3%

Page | 66
a) Suppose we let 𝑥 = {1, 2, 3, 4, 5} represent each of the modes of learning, create a
probability distribution table for this experiment.
b) Create a histogram for the probability distribution.
c) Describe the shape of the probability distribution.
d) Calculate the sum of 𝑃(𝑥).
4) Given the following table based on results from a survey,

𝒙 𝑷(𝒙)
1 0.60
2 0.14
3 0.35

does the table describe a probability distribution?


Explain why or why not.

Page | 67
4.3 – The Mean and Variance of a Probability Distribution

The mean and variance of a probability describe the “average” value and the variation of the
random variable, respectively.

The mean of a probability distribution is also known as the expected value; and is a measure
of central tendency of a probability distribution.

For a probability distribution, the mean represents the average value we are expected to get if
trials of the experiment is continued indefinitely.

The variance of a probability distribution is a measure of dispersion of a probability


distribution.

Mean/Expected Value of a probability distribution

𝐸(𝑥) = 𝜇 = ∑ 𝑥 ∙ 𝑃(𝑥)

Variance of a probability distribution

𝜎 2 = ∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥)

An alternative formula for the variance is:


𝜎 2 = [∑ 𝑥 2 𝑃(𝑥)] − 𝜇 2

Example 1

Determine the mean and variance for the following probability distribution.

Two coins are tossed, one after another. The following probability distribution table shows
the probability of 𝒙, where 𝒙 is a random variable representing the number of tails.

Page | 68
𝒙 𝑷(𝒙)
0 0.25
1 0.50
2 0.25

Calculating the mean:

𝐸(𝑥) = ∑ 𝑥 ∙ 𝑃(𝑥)

𝐸(𝑥) = (0)(0.25) + (1)(0.50) + (2)(0.25)

𝐸(𝑥) = 0 + 0.50 + 0.50

𝐸(𝑥) = 1

The mean of the probability distribution is 𝑥 = 1. This means that the expected value of
the distribution is 1. If two coins are tossed indefinitely, the average outcome is
obtaining 1 tail.

Calculating the variance:

𝜎 2 = ∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥)

𝒙 𝒙−𝝁 (𝒙 − 𝝁)𝟐 𝑷(𝒙) (𝒙 − 𝝁)𝟐 ∙ 𝑷(𝒙)


0 0 − 1 = −1 1 0.25 0.25
1 1−1=0 0 0.50 0
2 2−1=1 1 0.25 0.25

∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥) = 0.5

𝜎 2 = 0.5

The variance of the probability distribution is 0.5.

Page | 69
Example 2

A box has five tickets, each numbered 1, 2, 3, 4, or 5. Two cards are drawn from the box one
after another. After drawing the first card, it is returned to the box before drawing the
second card.

a) Create a probability distribution table to show the probability of drawing an odd


number.
b) Determine the mean and variance of the probability distribution

a) Let 𝑥 represent the number of odd numbers


𝑥 = {0, 1, 2}

𝒙 𝑷(𝒙)
0 2 2 4
× =
5 5 25
1 2 3 3 2 12
( × )+( × )=
5 5 5 5 25
2 3 3 9
× =
5 5 25

b) Determine the mean.

𝐸(𝑥) = ∑ 𝑥 ∙ 𝑃(𝑥)

4 12 9
𝐸(𝑥) = (0) ( ) + (1) ( ) + (2) ( )
25 25 25

12 18
𝐸(𝑥) = 0 + +
25 25

30
𝐸(𝑥) =
25

𝐸(𝑥) = 1.2

Page | 70
The mean of the probability distribution is 𝑥 = 1.2. This means that the expected value
of the distribution is 1.2. If two cards are drawn one after another indefinitely, the
average amount of odd numbers in the outcome is 1.2.

Determine the variance.

𝜎 2 = ∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥)

𝒙 𝒙−𝝁 (𝒙 − 𝝁)𝟐 𝑷(𝒙) (𝒙 − 𝝁)𝟐 ∙ 𝑷(𝒙)


0 0 − 1.2 = −1.2 1.44 4 4
1.44 × = 0.2304
25 25
1 1 − 1.2 = −0.2 0.04 12 12
0.04 × = 0.0192
25 25
2 2 − 1.2 = 0.8 0.64 9 9
0.64 × = 0.2304
25 25

∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥) = 0.48

𝜎 2 = 0.48

The variance of the probability distribution is 0.48.

Example 3

In a Pick 3 Lottery, you can bet $10 by selecting three digits, each between 0 and 9, inclusive.
The numbers do not repeat themselves in the Lottery. If the same three numbers you picked
are drawn in the same order, you win and collect $300.

Create a probability distribution table to calculate the expected amount of winnings.

Page | 71
Let 𝑥 represent the amount of winnings

𝑥 = {−10, 290}

𝒙 𝑷(𝒙)
290 1 1 1 1
× × =
10 9 8 720
-10 1 1 1 719
1−( × × )=
10 9 8 720

1 719
𝐸(𝑥) = (290) ( ) + (−10) ( )
720 720

6900
𝐸(𝑥) = −
518400

𝐸(𝑥) = 0.013310185 …

The expected winnings for playing the Pick 3 Lottery is $0.01 on average.

Practice

1) Determine the mean and variance of the following probability distributions


a)

𝒙 𝑷(𝒙)
1 0.52
2 0.06
3 0.42

Page | 72
b)

𝒙 𝑷(𝒙)
0 1
2
1 1
3
2 1
12
3 1
12

2) A die is rolled two times, one after another.


a) Create a probability distribution with a random variable that represents the number
of even numbers in the result.
b) Determine the mean of the distribution.
c) Determine the variance of the distribution.

3) A box has five tickets, each labelled, A, B, C, D, or E. Two are to be randomly selected
without replacement. Let 𝑥 represent the number of occurrences of either a ticket
labelled A or a ticket labelled B. Determine the expected value of occurrences of A or B.

4) In a card game at a casino, the player is to select a card from a deck of 52 cards. Suppose
the casino will pay $25 if the player selects a King. If the player does not select a King,
the player loses $2 to the game.
a) What is the expected value of the casino’s winning?
b) What is the expected value of the player’s winning?

Page | 73
4.4 – Binomial Probability Distribution

A unique and important type of probability distribution is the binomial probability distribution.
It allows calculations of probabilities from experiments with only two possible outcomes or two
categories of outcomes.

A binomial probability distribution is a probability distribution created from an experiment,


known as a binomial experiment, which has the following properties:

1. The experiment consists of n repeated trials;


2. The trials are independent; that is, the outcome on one trial does not affect the
outcome on other trials;
3. Each trial can result in only two possible outcomes or two categories of outcomes. We
call one of these outcomes a success and the other, a failure;
4. The probability of success, denoted by 𝑝, is the same on every trial.

The following is an example of an experiment that would result in a binomial probability


distribution. Consider the experiment of flipping a coin 2 times and counting the number of
times the coin lands on heads. This would be a binomial distribution because:

1. The experiment consists of repeated trials. We flip a coin 2 times.


2. The trials are independent; that is, getting heads on one trial does not affect whether
we get heads on other trials.
3. Each trial can result in just two possible outcomes - heads or tails.
4. The probability of success is constant - 0.5 on every trial.

The probabilities in such an experiment can be calculated by using the Binomial Probability
Formula:

Suppose a binomial experiment consists of 𝑛 trials and results in 𝑥 successes. If the


probability of successes on an individual trial is 𝑝, then the binomial probability can be
calculated with the following formula:

𝑛!
𝑃(𝑥) = 𝑝 𝑥 𝑞 𝑛−𝑥
(𝑛
𝑥! − 𝑥)!

where

𝑥 - The number of successes that result from the binomial experiment.

𝑛 - The number of trials in the binomial experiment.

Page | 74
𝑝 - The probability of success on an individual trial.

𝑞 - The probability of failure on an individual trial, which is 1 − 𝑝

𝑃(𝑥) − Binomial probability - the probability that an 𝑛-trial binomial experiment results in
exactly 𝑥 successes, when the probability of success on an individual trial is 𝑝.

𝑛!
is the number of combinations of 𝑥 items selected from 𝑛 different items
(𝑛−𝑥)! 𝑥!

Example 1

If a coin is flipped 10 times, what is the probability of having exactly 4 heads on the outcome?

Recall that flipping a coin is a binomial experiment because it meets the four criteria.
Each flip is independent of one another. There are only two possible outcomes, heads
and tails.

Because the question is asking for the probability of 4 heads, the flipping heads will be
considered as a “success”.

We want to know the probability of 4 successes, 𝑥 = 4

The number of trials in the experiment is 10 flips, 𝑛 = 10

The probability of success is the probability of having heads as an outcome in an


individual trial, 𝑝 = 0.5

The probability of failure is the probability of having tails as an outcome in an individual


trial, 𝑞 = 0.5

𝑛!
𝑃(𝑥) = 𝑝 𝑥 𝑞 𝑛−𝑥
𝑥! (𝑛 − 𝑥)!

10!
𝑃(4) = (0.5)4 (0.5)10−4
4! (10 − 4)!
10!
The calculation of 4!(10−4)! will allow us to determine how many ways there are of
having 4 heads out of 10 flips. Because the order of having 4 heads does not matter,
10!
there are 4!(10−4)! many ways to arrange 4 heads out of 10 flips.

Page | 75
The calculation of (0.5)4 allows us to determine the probability of flipping 4 heads.

The calculation of (0.5)10−4 allows us to determine the probability of not getting 4


heads, in other words, 6 tails.

𝑃(4) = 210(0.0625)(0.015625)

𝑃(4) = 0.205078125 …

𝑃(4) =̇ 0.205

The probability of flipping exactly 4 heads out of 10 flips is about 20.5%.

Example 2

In an experiment, a coin is flipped 4 times. Create a probability distribution table for this
experiment by considering the number of heads in the outcome.

Let 𝑥 be the random variable representing the number of heads in the outcome

𝒙 𝑷(𝒙)
0 𝑃(0)
1 𝑃(1)
2 𝑃(2)
3 𝑃(3)
4 𝑃(4)

To complete the table, we need to calculate the probabilities:

4!
𝑃(0) = (0.5)0 (0.5)4−0 = 0.0625
0! (4 − 0)!

4!
𝑃(1) = (0.5)1 (0.5)4−1 = 0.25
1! (4 − 1)!

4!
𝑃(2) = (0.5)2 (0.5)4−2 = 0.375
2! (4 − 2)!

Page | 76
4!
𝑃(3) = (0.5)3 (0.5)4−3 = 0.25
3! (4 − 3)!

4!
𝑃(4) = (0.5)4 (0.5)4−4 = 0.0625
4! (4 − 4)!

Probability Distribution for flipping a coin 4 times


0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
0 1 2 3 4

Example 3

In 2007, 75% of the population in the Island of Micronesia was reported to be infected with
the Zika virus.

a) If 15 individuals are selected with replacement from the population, what is the
probability of selecting exactly 3 people with the Zika virus?

This is a binomial experiment as the four criteria are satisfied.

𝑛 = 15
𝑥=3
𝑝 = 0.75
𝑞 = 0.25

𝑛!
𝑃(𝑥) = 𝑝 𝑥 𝑞 𝑛−𝑥
𝑥! (𝑛 − 𝑥)!

Page | 77
15!
𝑃(3) = (0.75)3 (0.25)15−3
3! (15 − 3)!

𝑃(3) = 455(0.421875)(0.000000059)

𝑃(3) =̇ 0.000011441

The probability of selecting 3 people with the Zika virus is about 0.00114%.

b) If 15 individuals are selected with replacement from the population, what is the
probability of selecting exactly 5 people without the Zika virus?

𝑛 = 15
𝑥=5
𝑝 = 0.25
𝑞 = 0.75

𝑛!
𝑃(𝑥) = 𝑝 𝑥 𝑞 𝑛−𝑥
𝑥! (𝑛 − 𝑥)!

15!
𝑃(5) = (0.25)5 (0.75)15−5
5! (15 − 5)!

𝑃(5) = 3003(0.000976562)(0.056313514)

𝑃(5) =̇ 0.165

The probability of selecting 5 people without the Zika virus is about 16.5%.

c) If 15 individuals are selected with replacement from the population, what is the
probability of selecting 13 or more individuals with the Zika virus?

Selecting 13 or more individuals mean that 13 people have the virus, or 14 people
have the virus, or all 15 people have that virus.

𝑃(13) + 𝑃(14) + 𝑃(15)

Page | 78
15!
𝑃(13) = (0.75)13 (0.25)15−13 = 0.155907045
13! (15 − 13)!

15!
𝑃(14) = (0.75)14 (0.25)15−14 = 0.066817305
14! (15 − 14)!

15!
𝑃(15) = (0.75)15 (0.25)15−15 = 0.013363461
15! (15 − 15)!

𝑃(13) + 𝑃(14) + 𝑃(15) =̇ 0.236087811

The probability of selecting 3 people with the Zika virus is about 23.6%.

d) If 5 individuals are selected with replacement from the population, what is the
probability that at least one is infected with the Zika virus?

Selecting at least one individual with the Zika virus means one or more individuals
have the virus.
This means that the probability of at least one individual can be calculated by

𝑃(1) + 𝑃(2) + 𝑃(3) + 𝑃(4) + 𝑃(5)

With the binomial formula, this calculation could be very inconvenient.


Observe that the complementary event of having at least one person who has the
virus is that no one has the virus.
Thus

𝑃(1) + 𝑃(2) + 𝑃(3) + 𝑃(4) + 𝑃(5) = 1 − 𝑃(0)

𝑛!
𝑃(𝑥) = 𝑝 𝑥 𝑞 𝑛−𝑥
𝑥! (𝑛 − 𝑥)!

5!
𝑃(0) = (0.75)0 (0.25)5−0
0! (5 − 0)!

𝑃(0) = 0.000976562

𝑃(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒) = 1 − 0.000976562

= 0.999023437

Page | 79
Thus, the probability of having at least one person with the virus from 5 selected individuals is
99.9%.

Practice

1) In a consumer report, it was indicated that 79.5% of the flights in North America arrive
on time.

a) If an experiment is done to randomly select 3 upcoming flights, create a probability


distribution table to show the probability of 𝑥, with 𝑥 representing the number of
flights arriving on time.

b) Among the upcoming 12 flights, what is the probability that exactly 2 of them will
not arrive on time?

c) Among the upcoming 12 flights, what is the probability that exactly half of them will
arrive on time?

d) Among the upcoming 5 flights, what is the probability that none of them will arrive
on time?

2) According to the Red Cross Blood Organization, the blood type most requested by
hospitals is type O. About 45% of the North American population has blood type O.

a) If 4 blood donors are selected at random, create a probability distribution histogram


to show the probability of 𝑥, where 𝑥 represents the number of donors with blood
type O.

b) If 7 blood donors are selected at random, what is the probability that exactly 3 of
them will have blood type O.

c) If 10 blood donors are selected at random, what is the probability that exactly 6 will
have other blood types?

d) If 15 blood donors are selected at random, what is the probability that at least one
will have blood O?

Page | 80
3) In a clinical test of a new medication, 65.3% of the subjects treated with 5mg of the
medication experienced nausea as a side effect.

a) If 8 subjects are randomly selected and treated with 5 mg of the medication, what is
the probability that at least 6 subjects will experience nausea?

b) If 10 subjects are randomly selected and treated with 5 mg of the medication, what
is the probability that 4 or 5 subjects will experience nausea?

c) If 6 subjects are randomly selected and treated with 5 mg of the medication, what is
the probability that everyone will not experience the side effect?

4) In a market research survey, a brand name has 88% recognition rate.

a) If 9 consumers are randomly selected, what is the probability that all consumers
recognize the brand name?

b) If 7 consumers are randomly selected, what is the probability that at least 6 of them
recognize the brand name?

c) If 20 consumers are random selected, what is the probability that 50% of them do
not recognize the brand name?

Page | 81
4.5 – The Mean and Variance Binomial Probability Distribution

Recall that the central tendency and dispersion of a probability distribution can be described by
calculating the mean and variance a probability distribution.

The following table shows a comparison between mean and variance for any probability
distribution and the same measures for the binomial probability distribution.

Any probability distribution Binomial distribution

𝜇 = ∑[𝑥 ∙ 𝑃(𝑥)] 𝜇 =𝑛∙𝑝

𝜎 2 = ∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥)


or 𝜎2 = 𝑛 ∙ 𝑝 ∙ 𝑞
𝜎 2 = [∑ 𝑥 2 𝑃(𝑥)] − 𝜇 2

Example

Determine the mean and variance of the probability distribution of this experiment:

A die is rolled 3 time. A random variable is used to describe the number of fours rolled
in the outcomes.

Let 𝑥 be the random variable.

𝑥 = {0, 1, 2, 3}

𝒙 𝑷(𝒙)
0 𝑃(0)
1 𝑃(1)
2 𝑃(2)
3 𝑃(3)

Calculate the probabilities for this binomial experiment:

Page | 82
3! 1 0 5 3−0
𝑃(0) = ( ) ( ) =̇ 0.578703703
0! (3 − 0)! 6 6

3! 1 1 5 3−1
𝑃(1) = ( ) ( ) =̇ 0.347222222
1! (3 − 1)! 6 6

3! 1 2 5 3−2
𝑃(2) = ( ) ( ) =̇ 0.069444444
2! (3 − 2)! 6 6

3! 1 3 5 3−3
𝑃(3) = ( ) ( ) =̇ 0.00462962963
3! (3 − 3)! 6 6

Using the probability distribution table, the mean can be calculated by:

𝜇 = ∑[𝑥 ∙ 𝑃(𝑥)]

But since this probability distribution is a binomial one, the quicker way to determine
the mean is:

𝜇 =𝑛∙𝑝

1
𝜇 = (3) ( )
6

𝜇 = 0.5

Both formulas will give the same result. The mean is 0.5.

Calculating variance:

𝜎2 = 𝑛 ∙ 𝑝 ∙ 𝑞

1 5
𝜎 2 = (3) ( ) ( )
6 6

𝜎 2 = 0.416̅

The variance is 0.416̅

Page | 83
Practice

1) A research study was conducted to determine whether vitamin C has an effectiveness in


preventing the common cold. Twenty individuals were prescribed vitamin C in the study
and were observed to see whether they catch a cold. Suppose the probability of
catching a cold is 0.6, what is the mean of the probability distribution?

2) From the above question, what is the variance of the probability distribution?

3) A meteorologist estimates that over the next 15 days, the probability of rain during the
day is 46%, what is the mean of the probability distribution?

4) From the above question, what is the variance of the probability distribution?

Page | 84
Unit 5: Continuous Probability Distributions

5.1 – Introduction to Continuous Probability Distribution

Recall that probability distributions could be considered as discrete or continuous depending on


the random variable.

A continuous probability distribution is probability distribution that is associated with the


values of a continuous random values.

There are different types of continuous probability distributions:

- Uniform distribution
- Exponential distribution
- Normal distribution

Properties of a Continuous Probability Distribution

1. For a continuous probability distribution, 𝑓(𝑥) ≥ 0 for all values x of the random
variable.
2. The total area under the graph of the probability distribution (under the graph of
𝑓(𝑥) = 𝑦) is always 1.
3. The probability that an observed value of 𝑥 falls between 𝑎 and 𝑏, 𝑃(𝑎 < 𝑥 < 𝑏), is the
area between 𝑎 and 𝑏 under the graph. For the regions of interest (for which we want
to calculate the probability), the probability that 𝑥 falls into the region is equal to the
area directly above the region and under the graph.
4. The probability of an exact value, 𝑃(𝑥 = 𝑎), is zero.

Example 1: Uniform Distribution

Assume that the time of birthday of a New Year’s baby at a hospital will be a random time
between midnight and 2 AM. Let 𝒙 be the number of minutes after midnight that the baby
will be born.

Page | 85
a) Graph the probability distribution.

Because the time of birth will be a random time, we can assume that the probability of
birth at any 𝑥 minute after minute is the same. Thus, the graph will show a uniform
probability distribution. The second probability states that the area under the graph is
1
always 1. Thus, the 𝑃(𝑥) = 120 = 0.008

b) Determine the probability of a baby being born between 12AM and 1:30 AM.

The probability will be the area under the graph between 𝑥 = 0 and 𝑥 = 90.
The probability can be found by determining area is shaded:

Area of the rectangle is (90)(0.008) = 0.72


The probability of a baby being born between 12AM and 1:30 AM is
𝑃(0 < 𝑥 < 90) = 0.72

c) Determine the probability of a baby being born at midnight.

The probability, 𝑃(0) = 0, because for continuous variables, the probability of an exact
value is zero. This means that the probability of a baby being born at 12AM (not a tenth
of a second before or after midnight), is zero.

Page | 86
Example 2: Exponential Distribution

The waiting time at a health clinic has an exponential probability distribution modelled by the
formula:
𝟏 −𝟏𝒙
𝒇(𝒙) = 𝒆 𝟓
𝟓
Where 𝒙 is the amount of wait-time in minutes
The following graph represents the probability distribution, 𝒇(𝒙), where 𝒙 is between 0
minutes to 20 minutes. Determining the area of the exponential curve is beyond the scope of
this course, so, the areas are labelled.

a) Determine the probability that you have to wait more than 10 minutes

The area = 0.0855 + 0.0315

𝑃(𝑥 > 10) = 0.117


b) Determine that probability that you have to wait between 5 to 15 minutes

The area = 0.233 + 0.0855

𝑃(5 < 𝑥 < 15) = 0.3185

Page | 87
Practice

1) The probability distribution of an earthquake occurrence in a particular region is


1
modelled using 𝑦 = 𝑒 −4𝑥 , where 𝑥 is the magnitude of the earthquake.
The graph and the areas are shown below

a) Determine the probability of the occurrence of an earthquake with magnitude


between 2 and 6.

b) Determine the probability of the occurrence of an earthquake with magnitude less


than 4.

c) Determine the probability of the occurrence of an earthquake with magnitude


between 0 and 2, or magnitude between 6 and 7.

2) A uniform probability distribution can be used to show the probability of an error


occurring in the process of 3D printing. For a particular object, the time it took to 3D-
print it is 35 minutes.

a) Sketch the graph that represents the probability distribution.

Page | 88
b) Determine the probability of an error occurring between the 1st and 5th minute
during the 3D-printing process.

c) Determine the probability of an error occurring after 20 minutes during the process.

d) Determine the probability of an error occurring in the first 3.5 minutes of the
process.

e) Determine the probability of an error occurring in the second half of the process.

Page | 89
5.2 – Normal and Standard Normal Distribution

One of the types of a continuous distribution is the normal distribution.

A normal distribution is shaped as a bell-curve. It is symmetrical along the center, vertical line,
where the mean, 𝜇, is located. If a continuous random variable has a normal distribution, then
it can be described by the following equation
1 𝑥−𝜇 2
𝑒 −2 ( 𝜎 )
𝑦=
𝜎√2𝜋

The normal distribution is important because many naturally occurring measurements follow
such distribution if the frequencies of the measurements are recorded. For example, heights,
weights, and test scores tend to following a normal distribution.

Recall that mean, median, and mode are measures of central tendency. In a normal
distribution, the mean, median, and mode are the same value. Standard deviation is one of the
measures of variation in a data set. It measures on average, how many the data values are
deviating from the mean.

When the mean is 0 and the standard deviation is 1, the normal distribution is known as a
standard normal distribution.

The location of every data value can be determine by determining its rank with respect to the
mean using standard deviation as a unit of measurement. This is known as the z-score of a data
value.

Page | 90
The Z-Score
Consider a normal variable 𝑥 with the mean 𝜇 and standard deviation 𝜎. We define the
standard score (or z score) by the formula:
𝑥−𝜇
𝑧=
𝜎

The z-score of the mean is always a value of 0.

The z-score of the data values that are below the mean will be negative.

The z-score of the data values that are above the mean will be positive.

Probability

From the previous chapter, we learned that the area between two values 𝑎 and 𝑏 under the
curve of the probability distribution represents the probability, 𝑃(𝑎 < 𝑥 < 𝑏)

The Empirical Rule states that there is approximately a fixed amount of area within each
standard deviation from the mean.

Thus, approximately 68.2% of the data is within 1 standard deviation away from the mean.

Approximately, 95.4% of the data is within 2 standard deviations away from the mean.

Approximately, 99.6% of the data is within 3 standard deviations away from the mean.

Translating to the concepts of probability, one can conclude that there is a 68.2% probability of
finding a data value that is within 1 standard deviation away from the mean.

Page | 91
Example 1

Studies show that the score of a mathematics entrance test for college students in Canada is
normally distributed, with a mean of 68.0% and a standard deviation of 4.3%.

a) What is the z-score of the test score 90%?

90 − 68
𝑧=
4.3
22
𝑧=
4.3
𝑧 = 5.11627907
𝑧 =̇ 5.12
z-scores are usually rounded to 2 decimal places.
The test score, 90%, is about 5.12 standard deviations above the mean.

b) What is the z-score of the test score 55%?

55 − 68
𝑧=
4.3
−13
𝑧=
4.3
𝑧 = −3.023255814
𝑧 =̇ − 3.02

A negative z-score means that the test score, 55%, is below the mean.
The test score, 55%, is about 3.02 standard deviations below the mean.

c) What test score is located 2.5 standard deviations above the mean?

Page | 92
𝑥 − 68
2.5 =
4.3
𝑥 = (2.5)(4.3) + 68
𝑥 = 78.75

The test score, 78.75% is located 2.5 standard deviation above the mean test score.

d) Which test score outperformed 84% of all the test scores?

Using the Empirical Rule, notice that the areas below 1𝜎 is sum up to 84% of the total
area. That means the test score located at 1𝜎 is higher than 84% of the total.

We need to determine the test score that has a z-score of 1.

𝑥 − 68
1=
4.3
𝑥 = (1)(4.3) + 68
𝑥 = 72.3

The test score that outperforms 84% of all the test scores is 72.3%

Page | 93
The Z-Score Table

From the above example, it is evident that knowing the z-score of a data value and the
Empirical Rule is very useful. Knowing the area under the curve can help us determine the
percentage of data is that below or above a specific data value. What if the area above or
below a specific data value is of interest, but that data value is located at, say,1.34 standard
deviations above the mean? Notice the Empirical Rule only provides areas between 1𝜎, 2𝜎, 3𝜎,
and −1𝜎, −2𝜎, −3𝜎. What if the data value of interest is not located at those exact points?

The Z-score Table is helpful in those situations. It provides the estimated area under the normal
distribution at various z-score values.

Observe the Z-score table provided in the following pages:

Page | 94
Page | 95
Page | 96
The z-score values run along the first column on the left of the table. Along the top row are the
values of the hundredth digit of a z-score. The body of the table provides the areas to the left of
the z-score of interest.

For example, if you wish to determine the area to the left of the z-score 1.43, begin by using the
page with positive z-scores on the first column. Find 1.4. Move along that row and stop at the
column with the heading 0.03. The value located in that cell is 0.9236. The area to the left of
the z-score 1.43 is 0.9236. This also means that 92.36% of the data is below the data value that
is 1.43 standard deviations above the mean.

Example 2

Determine the area under the normal distribution for the following z-scores.

a) 𝒛 < 𝟐. 𝟎𝟏

From the z-score table, the area is 0.9772

b) 𝒛 > −𝟏. 𝟐𝟐

From the z-score table, the area is 0.1112

Page | 97
c) −𝟐. 𝟎𝟐 < 𝒛 < 𝟏. 𝟕𝟒

Reading the larger area first, it is the area to the left of the z-score 1.74. The area is
0.9591

The smaller area is the area to the left of the z-score -2.02. The area is 0.0217.

Therefore, the area in between is the difference between the above areas.

0.9591 − 0.0217 = 0.9374

Example 3

The probability distribution of percentage chance of precipitation in a city is a normal


distribution with a mean of 34% and a standard deviation of 7.6%.

a) Determine the probability of having a chance of precipitation that is lower than 50%.

𝑃(𝑥 < 50) = ?

Find the z-score of 50.


50 − 34
𝑧=
7.5

𝑧 = 2.13

Find 𝑃(𝑧 < 2.13) using the z-score table and reading the area to the left of the z-score

𝑃(𝑧 < 2.13) = 0.9834

Page | 98
The probability of having a chance of precipitation that is lower than 50% is 98.34%

b) Above which value of chance of precipitation will we find 80% of probability?

Let 𝑥𝑜 represent the value we are looking for.

𝑃(𝑥 > 𝑥𝑜 ) = 0.8

First, we use the z-score table to determine the z-score that has 80% of area above it.
Recall that the table provides areas to the left of a z-score. Thus, when we read the
table, we are looking for the z-score with 20% area below it.

The closest z-score is -0.84. It has an area of 0.2005 below it.


Using the z-score, we determine 𝑥𝑜
𝑥𝑜 − 34
−0.84 =
7.6

𝑥𝑜 = (−0.84)(7.6) + 34

𝑥𝑜 = 27.616

Above 27.616% chance of precipitation, we will find 80% of probability of occurrence.

Because 20% of the data is below the value 27.616, this data value is also known as the
20th percentile.

Page | 99
Practice

1) Using the z-score table, determine the following:

a) 𝑃(𝑧 < −3.29)


b) 𝑃(𝑧 > 0.11)
c) 𝑃(𝑧 > −1.49)
d) 𝑃(𝑧 < 2.86)
e) 𝑃(−2.00 < 𝑧 < 2.00)
f) 𝑃(1.34 < 𝑧 < 2.50)

2) Using the z-score table, determine the following 𝑧𝑜 values:


a) 𝑃(𝑧 < 𝑧𝑜 ) = 0.9929
b) 𝑃(𝑧 < 𝑧𝑜 ) = 0.0075
c) 𝑃(𝑧 > 𝑧0 ) = 0.1685
d) 𝑃(𝑧 > 𝑧𝑜 ) = 0.2500

3) The height of babies at the age of 6 months follows a normal distribution. From a study
that collect heights from a sample of babies, it was found that the mean is 26.7 inches
and the standard deviation is 0.45 inches.

a) What is the probability of finding a baby at the age of 6 months with a height that is
less than 27 inches?

b) What is the probability of finding a baby at the age of 6 months with a height that is
greater than 26.9 inches?

c) What is the probability of finding a baby at the age of 6 months with a height that is
1 inch between the mean?

d) What percentage of babies can be found to be taller than 25.5 inches at the age of 6
months?

Page | 100
4) At a company, the amount of retirement savings employees invest per month was
recorded. The data follows a normal distribution with a mean of $360 and a standard
deviation of $22.30.

a) What is the probability of finding an employee who invests more than $400 per
month?

b) What is the probability of finding an employee who invests between $350 and $420
per month?

c) Below what amount of investment will we find 75% of the employees’ investment?

d) What amount of investment is at the 90th percentile?

Page | 101
5.3 – Central Limit Theorem

The Central Limit Theorem (CLT) states that if random samples are taken from a population and
the mean in each random sample is calculated, the means tend to form a normal distribution.

This phenomenon can be observed using the following example:

Event Probability Distribution


Rolling a die
Probability Distribution of Rolling One Die
Let 𝑥 = number face up on die 𝑥 =
{1, 2, 3, 4, 5, 6} 0.20000
0.16000
0.12000
P(x)
0.08000
0.04000
0.00000
1 2 3 4 5 6
x

Rolling two dice


Probability Distribution of Average Value
Let 𝑥 = average number between the
when Rolling Two Dice
two rolls
𝑥 = {1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6} 0.2

0.15

0.1

0.05

0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Average Value of Two Dice

Rolling three dice


Let 𝑥 = average number between the Probability Distribution of Average
three rolls Value when Rolling Three Dice
0.15

0.1

0.05

0
1
1.33
1.66
2
2.33
2.66
3
3.33
3.66
4
4.33
4.66
5
5.33
5.66
6

Average Value of Three Dice

Page | 102
In the first case, 𝑛 = 1 because one die is rolled and only one value is recorded. Notice that the
probability distribution is not a normal distribution.

In the second event, 𝑛 = 2 since two values are recorded when we roll two dice.

Each time two dice are rolled, it is considered as a random sample. The mean is found in each
random sample. The probability of obtaining each mean is calculated. Notice that the
probability distribution is now approximating the normal distribution.

In the third event, 𝑛 = 3 since three values are rolled in each random sample. Multiple random
samples are made and the mean is found between the three values that are rolled. The
probability of obtaining each mean is calculated. Notice that the probability distribution is even
closer to the normal distribution.

This pattern will continue as 𝑛 increases.

The mean calculated from each random sample is known as the sample mean.

When all the sample means are used to create a frequency distribution, the distribution is
known as a sampling distribution. In the second and the third event, the probability
distributions are both sampling distributions since the sample means are recorded and
graphed.

The Central Limit Theorem

If random samples of 𝑛 observations are taken from a population that is not normally
distributed with a mean of 𝜇 and a standard deviation of 𝜎, then as 𝑛 increases, the sampling
distribution of the sample mean 𝑥̅ is approximately normally distributed, with a mean of 𝜇 and
𝜎
a standard deviation of 𝑛.

Page | 103
In other words,
The mean of the sampling distribution is 𝜇𝑥̅ = 𝜇
𝜎
The standard deviation of the sampling distribution is 𝜎𝑥̅ =
√𝑛

The approximation becomes more accurate as 𝑛 increases.

If the population the random samples are drawn from is a normal distribution, the sampling
distribution of 𝑥̅ will also be normal.
If the population the random samples are drawn from is a uniform distribution, the sampling
distribution of 𝑥̅ will approximate the normal distribution as soon as 𝑛 = 3
If the population the random samples are drawn from is a skewed distribution, the sampling
distribution of 𝑥̅ will approximate the normal distribution with at least 𝑛 = 30.

In general, we will use 𝑛 = 30 as a bench mark to consider a sample to be large enough.


When 𝑛 ≥ 30, the sample size will be sufficiently large.
When 𝑛 < 30, the sample size will be considered to be a small sample.

The examples below will demonstrate how the Central Limit Theorem (CLT) is applied.

Example

The weight of adult males in the population is a normal distribution that has a mean of 182.9
lb and a standard deviation of 40.8 lb. When designing elevators, one important
consideration is the weight capacity. Most of the current elevators have a weight capacity of
16 passengers with a total weight of 5000 lb. This means an assumption that the mean weight
of one passenger is 312.5 lb.
What is the probability that the mean of 16 passengers will exceed 180 lb?

If the capacity is 16 passengers, then any 16 people on the elevator can be considered
to be a random sample of the population.

𝑛 = 16

This is a small population because it is below 30; however, the sample is taken from a
population that has a normal distribution, so the sampling distribution will also be a
normal distribution.

The sampling distribution will have the following mean and standard deviation:

Page | 104
𝜇𝑥̅ = 𝜇 = 182.9

𝜎 40.8
𝜎𝑥̅ = = = 10.2
√𝑛 √16

Since we are trying to find the probability that the mean of 16 passengers will exceed
180 lb, we need to determine the z-score and the area to the right of 180 lbs.

𝑥̅ = 180

180 − 182.9
𝑧=
10.2

𝑧 = −0.28

𝑃(𝑧 > −0.28) = 1 − 0.3897

𝑃(𝑧 > −0.28) = 0.6103

The probability that the mean of 16 passengers will exceed 180 lbs is 61.03%.

Practice

1) The population of quarters has a mean of 5.67 g, but the weights of individual quarters
vary. The distribution of its weight has a standard deviation of 0.1 g. Random samples of
32 quarters are taken.

a) What is the shape of the sampling distribution?

b) What is the probability that a random sample will have a mean greater than 5.68g?

Page | 105
c) What is the probability that a random sample will have a mean less than 5.65g?

2) Let 𝑥 represent the scores of a mathematics achievement test given to Grade 8


students. Suppose that for the population of all test scores, the mean is 76.2 and the
standard deviation of 15.2. If a random sample of 100 values of 𝑥 is to be obtained, find
the following probabilities:

a) The sample mean will be greater than 77


b) The sample mean will be between 75 and 80
c) The sample mean will be less than 74.5

3) The mean wage of the employees of a company is $31,450 and the standard deviation is
$980.

a) What is the probability that the mean wage of 35 randomly selected workers will
exceed $32,000?

b) What is the probability that the mean wage of 50 randomly selected workers will be
less than $31,100?

c) If 100 workers a selected in a random sample, what is the wage that exceeds 80% of
the sample means?

4) The normal daily human potassium requirement is in the range of 2000 to 6000 mg. The
potassium content in a glass of orange juice is normally distributed with a mean of 470
mg and a standard deviation of 22 mg.

a) If a person drinks 3 glasses of orange juice in a day, what is the shape of the
sampling distribution? Explain.

b) If a person drinks 3 glasses of orange juice in a day, what is the probability that the
person will have an amount of potassium that is greater than 500 mg?

Page | 106
c) If the potassium levels of 10 glasses of orange juice was measured for nutritional
information collection, what is the probability that the amount of potassium will be
between 450 and 550 mg?

Page | 107
Unit 6: Statistical Inference
6.1 – Estimating Population Mean for Large Samples

Making a statistical inference means to estimate and form judgement about the parameters of
a population based on a sample.

There are two types of statistical inference:

1) Estimating a population parameter


2) Hypothesis testing

Estimating a population parameter includes estimating a population mean or proportion.

For example, what is the average tuition that Ontario college students pay per year? The
average tuition of Ontario college students is the population mean being estimated. To make
this estimation, a survey could be conducted to take a sample of a group college students.

Another example would be, what is the proportion of all Ontario college students that relies on
taking the bus to school? Again, a survey could be conducted to take a sample of a group of
college students.

Before discussing the process of making an estimation of a population parameter, there are a
few definitions:

Point Estimate – a single value estimate of a population parameter


Confidence Interval – a range or an interval of value, associated with a specific confidence
level, used to estimate the true value of a population parameter.
Confidence level – the probability, 1 − 𝛼 , that the confidence interval actually contains the
population parameter.
E.g. A confidence level is 95% has a 𝛼 value of 5%. For an estimation estimated with a 95%
confidence level, this would mean that we are 95% confidence that the confidence interval
associated contains the true value of the population parameter.

Page | 108
Confidence Intervals
The most common confidence intervals are 90%, 95%, and 99%.

A confidence level allows us to communicate the percentage of confidence we are with our
estimation.
Every confidence interval has a critical value, 𝑧𝛼/2 , associated with it. The critical value is a
𝑧-score with the property that separates an area of 𝛼/2 in each tail of the standard normal
distribution.
Using the standard normal distribution, we can determine the critical values for the
following common confidence levels.
For example, if the confidence level is 90%, then we are confident that our estimation is
within 90% around the true value.

Confidence 𝜶 Critical Value


Level 𝒛𝜶/𝟐

90% 0.10 1.645

95% 0.05 1.96

99% 0.01 2.575

Page | 109
Finding Margin of Error using Confidence Intervals
The difference between the sample statistic and the population parameter is known as
the Margin of Error.
When estimating a population mean, the margin of error is calculated using
𝑠
𝐸 = 𝑧𝛼/2
√𝑛

Constructing a Confidence Interval


1) Determine the critical value associated with the confidence level, 𝑧𝛼/2
2) Calculate the Margin of Error, 𝐸
3) Using the value of 𝐸, calculate the lower confidence limit 𝑥̅ − 𝐸 and the upper
confidence limit 𝑥̅ + 𝐸, which will define the confidence interval

The estimated population mean: 𝜇 = 𝑥̅ ± 𝐸

𝑠
𝜇 = 𝑥̅ ± 𝑧𝛼/2
√𝑛

Example 1

A sample of 40 values was taken and it was found that the sample mean is 𝟏𝟔 and the sample
standard deviation is 2.5. Estimate the population mean at a 95% confidence level.

The critical value associated with a 95% confidence interval is 𝑧0.05


2

𝑠
𝜇 = 𝑥̅ ± 𝑧𝛼/2
√𝑛

2.5
𝜇 = 16 ± 1.96
√40

𝜇 = 16 ± 0.774758026

The lower confidence limit is 16 − 0.774758026 =̇ 15.23

The upper confidence limit is 16 + 0.774758026 =̇ 16.77

We are 95% confident that the estimated population mean is between 15.23 and 16.77.
Page | 110
Example 2

In a quality control department of a meat packaging company, a random sample of 35


packages gave a mean of 0.458 kg and a standard deviation of 0.082 kg. Construct a 99%
confidence interval for all packages.

The critical value associated with a 99% confidence interval is 𝑧0.01


2

𝑠
𝜇 = 𝑥̅ ± 𝑧𝛼/2
√𝑛
0.082
𝜇 = 0.458 ± 2.575
√35
𝜇 = 0.458 ± 0.035690864

The lower confidence limit is 0.458 − 0.035690864 =̇ 0.42231

The upper confidence limit is 0.458 + 0.035690864 =̇ 0.49369

We are 95% confident that the estimated population mean is between 0.42231 and
0.49369.

Practice

1) A sample of 100 values was taken and it was found that the sample mean is 56 and the
sample standard deviation is 11.5. Estimate the population mean at a 99% confidence
level.

2) The level of acidity in a liquid is measured using pH. A water sample taken from 58
rainfalls were analyzed to find a mean pH of 3.9 and a standard deviation of 0.6.
Estimate the pH of all rainfalls using a 90% confidence level.

3) A survey was conducted in a random sample of 500 people in Toronto and it was found
that the mean amount of time spent commuting to work was 2.4 hours per day and the
standard deviation was 0.5 hours. Using a 95% confidence level, estimate the mean
amount of time spent commuting to work for all Torontonians.

4) A machine is designed to produce rubber gaskets with a mean thickness of 0.125 inches
and a standard deviation of 0.01 inches. This is a normal distribution. Over time, usage
of the machine changes the mean of the thickness in the gaskets produced by the
machine. How many gaskets should be measured to be able to estimate the new mean
with a maximum error of the estimate as 0.001 inches and a 90% level of confidence?

Page | 111
6.2 – Estimating Population Mean for Small Samples

Recall the following:

1) 𝑛 ≥ 30 is considered to be a large sample size.

2) The Central Limit Theorem can be used if:


𝑛 ≥ 30
Or
The population distribution is a normal distribution

When a sample is small, the sampling distribution will approximate the normal distribution but
will not necessarily going to be an accurate bell-shaped curve. Because of this, z-scores are not
very appropriate as a measure in calculations.

Student’s t distribution
The Student’s t distribution is a distribution published by W. S. Gosset in 1908. He worked at
the Guinness brewery at the time and used statistics in his work in the brewery process. He
published his work under his pseudonym “Student”. When a small sample (𝑛 < 30) is taken
from a population that is normally distributed, the sampling distribution tends to following a 𝑡-
distribution.

Notice it is very similar to the Standard Normal distribution. There are different 𝑡-distributions
because for a different sample size, the shape of the curve is slightly different.

Page | 112
Properties of the 𝒕-distribution:
1) It has a bell-shaped curve that is symmetrical about 𝑡 = 0
2) When 𝑛 is infinitely large, the 𝑡 and the normal distributions are identical
3) The shape of the 𝑡-distribution is determined by the Degree of Freedom,
𝑑𝑓 = 𝑛 − 1

To determine the area under the curve at various 𝑡-values, the 𝑡-distribution table is used.

Observe the 𝑡-distribution table provided on the next page:

Page | 113
The values along the first column on the left are the Degrees of Freedom. The values along the
top row are the areas to the right of a specific 𝑡 value.
For example, for a sampling distribution when the sample size is 11, 𝑑𝑓 = 10. On the curve
with 𝑑𝑓 = 10, the 𝑡 value 0.700 has an area of 0.25 to its right.

Page | 114
Example 1

a) Determine the 𝒕 value such that the area under the curve to the right of the value is
0.15 for a Student’s 𝒕 distribution with 𝒅𝒇 = 𝟐𝟑

At the 23rd row and the column with the heading 0.15, the 𝑡 value reads 1.060.

b) Determine the 𝒕 value such that the area under the curve to the left of the value is
0.90 for a Student’s 𝒕 distribution with 𝒏 = 𝟏𝟐

If 0.90 is the area to the left of a 𝑡 value, then the area to the right of it is 0.10.
If 𝑛 = 12, then 𝑑𝑓 = 11
At the row where 𝑑𝑓 = 11 and the column with the heading 0.10, the 𝑡 value reads
1.363

c) Determine the 𝒕 value such that the area under the curve centered around the mean is
95% for a Student’s 𝒕 distribution with 𝒏 = 𝟏𝟓

If 95% is the area centered around the mean, then the remaining areas divided equally
into each of the tails is 2.5%.
In this case, 𝑑𝑓 = 14
At the row where 𝑑𝑓 = 14 and the column with the heading 0.025, the 𝑡 value reads
2.145.

Estimating a Population Mean

Depending on the sample size, the method of estimating a population mean would either be
using 𝑧-scores or 𝑡-values:

Conditions Method Calculating Margin of Error

When 𝑛 ≥ 30 Use normal distribution and 𝑧 scores 𝜎


𝐸 = 𝑧𝛼/2
√𝑛
When 𝑛 < 30 Use Student 𝑡-distribution 𝑠
𝐸 = 𝑡𝛼/2
√𝑛

Page | 115
Depending on the sample size, the critical values associated with the following common
confidence levels would either be 𝑧𝛼/2 or 𝑡𝛼/2 :

Confidence Level 𝜶 Critical Value Critical Value


𝒛𝜶/𝟐 𝒕𝜶/𝟐

90% 0.10 1.645 𝑥̅ − 𝜇


𝑡= 𝑠
95% 0.05 1.960 √𝑛

𝑑𝑓 = 𝑛 − 1
99% 0.01 2.575

Example 2

The mean amount of sodium in a sample of 20 bottles of ketchup is 142 mg with a standard
deviation is 2.3 mg. The sample distribution is a normal distribution. Construct a 99%
confidence interval estimate of the mean amount of sodium in ketchup.

The sample size is 20, which means that the 𝑡-distribution is more appropriate for the
estimation of the population mean.

The degree of freedom, 𝑑𝑓 = 19

The critical value associated with a 99% confidence level for 𝑑𝑓 = 19 is 𝑡0.01 = 2.861
2

To calculate the Margin of Error to determine the upper and lower confidence limits, we
use 𝑡𝛼/2 , which is 𝑡0.01 = 2.861 in this case.
2

𝑠
𝜇 = 𝑥̅ ± 𝑡𝛼/2
√𝑛

Page | 116
2.3
𝜇 = 142 ± 2.861
√20

𝜇 = 142 ± 1.471399811

The lower confidence limit is 142 − 1.471399811 =̇ 140.53

The upper confidence limit is 142 + 1.471399811 =̇ 143.47

We are 99% confident that the estimated mean amount of sodium in all ketchup is
between 140.53 mg and 143.47 mg.

Practice
1) The mean number of hours of tutoring services used per week by college students was
found to be 3.8 in a sample of 24 students. The standard deviation from the sample is
0.31 hours. Construct a 99% confidence interval estimate of the mean number of hours
per week used by all college students.

2) In a test of weight loss programs, 14 adults were sampled and found that after 6
months, their mean weight loss was found to be 22.4 lb with a standard deviation of 6.1
lb. Construct a 95% confidence interval estimate of the mean weight loss for all adults.

3) In a chemistry lab, experiments were performed to determine the impurities in


materials. In a sample of 27 solutions, the mean amount of calcium that precipitated
was found to be 0.137 moles with a standard deviation of 0.0061 moles (moles is a unit
to measure the amount of chemical substance). Construct a 90% confidence interval
estimate of the mean amount of calcium that would precipitate from this solution.

4) The amount of time it takes a customer to complete an order during online shopping
was measured. When sampling 18 customers, it was found that the mean amount of
time was 13.4 minutes and the standard deviation is 2.5 minutes.

a) Using a 95% confidence level, estimate the amount of time it takes a customer to
complete an order during online shopping in general.

b) Using a 90% confidence level, estimate the amount of time it takes a customer to
complete an order during online shopping in general.

Page | 117
c) Compare the confidence intervals in the above questions. Identify the interval that is
wider. Explain why that interval is wider.

d) If the company decided that they would need to improve their website when there
is a chance it will take customers longer than 15 minutes to complete an order
during online shopping, based on what was calculated, would the company need to
make improvements?

Page | 118
6.3 – Estimating Population Proportion for Large and Small Samples

Another population parameter that can be estimated is population proportion. Making this
estimation could be useful in situations where the following questions arise:

- What proportion of Ontario voters will vote for the Liberal Party?
- What proportions of grapes in the vineyard will survive through the winter?
- What proportion of college students will graduate this semester?

Constructing a Confidence Interval


Confidence intervals are constructed using the following:
- Sample proportion: 𝑝̂
- Margin of Error: 𝐸
- The critical value associated with the confidence level: 𝑧𝛼/2 or 𝑡𝛼/2

1) Determine the critical value associated with the confidence level, 𝑧𝛼/2 or 𝑡𝛼/2
2) Calculate the Margin of Error, 𝐸

𝑝̂𝑞̂ 𝑝̂𝑞̂
𝐸 = 𝑧𝛼/2 √ or 𝐸 = 𝑡𝛼/2 √
𝑛 𝑛

where 𝑞̂ = 1 − 𝑝̂

3) Using the value of 𝐸, calculate the lower confidence limit 𝑝̂ − 𝐸 and the upper
confidence limit 𝑝̂ + 𝐸, which will define the confidence interval

The estimated population proportion: 𝑝 = 𝑝̂ ± 𝐸

𝑝̂𝑞̂ 𝑝̂𝑞̂
𝑝 = 𝑝̂ ± 𝑧𝛼/2 √ or 𝑝 = 𝑝̂ ± 𝑡𝛼/2 √
𝑛 𝑛

Page | 119
Example 1

The following is a headline from the Toronto Star in November, 2013.

Rob Ford: Mayor’s support at 42 per cent, poll finds


A new poll from Forum Research puts Mayor Rob Ford’s support at 42 per cent.

The automated phone survey of 1,049 Toronto residents was done Wednesday night…
A new poll from Forum Research puts the mayor-in-name-only holding steady at 42
per cent, which is down just slightly from two weeks ago after Ford admitted to
smoking crack cocaine.

In the above example, the sample size is 1049. The sample proportion is 42% or 0.42.
In other words, 𝑝̂ = 0.42 and 𝑞̂ = 0.58
The sample size is greater than 30, thus, the critical value to be used is 𝑧𝛼 = 𝑧0.10 for a
2 2
90% confidence level.
If the City of Toronto would like to estimate the population proportion to determine the
proportion of all Toronto residents who are supporting Ford at a confidence level of
90%, then the Margin of Error is

𝑝̂ 𝑞̂
𝐸 = 𝑧𝛼 √
2 𝑛

(0.42)(0.58)
𝐸 = 1.645 √
1049

𝐸 = 0.025067833
This means that the interval is defined by the following lower and upper confidence
limits:
𝑝̂ − 𝐸 = 0.42 − 0.025067833 =̇ 0.395
𝑝̂ + 𝐸 = 0.42 + 0.025067833 =̇ 0.445
In conclusion, the city can state that they are 90% confidence that the proportion of all
Toronto residents support Ford is between 39.5% and 44.5%.

Page | 120
Example 2
For quality control in the meat department, 25 samples of beef packages were taken and it
was found that 3 were contaminated. Using a 95% confidence level, determine the
proportion of all beef packages that might be contaminated.
The sample size is 25. So, 𝑡0.05 will be used for a 95% confidence level.
2

The degree of freedom, 𝑑𝑓 = 24


Using the 𝑡 distribution table, 𝑡0.05 = 2.064
2

3
The sample proportion, 𝑝̂ = = 0.12
25

So, 𝑞̂ = 0.88

𝑝̂ 𝑞̂
𝐸 = 𝑡𝛼 √
2 𝑛

(0.12)(0.88)
𝐸 = 2.064 √
25

𝐸 = 0.134144122
This means that the interval is defined by the following lower and upper confidence
limits:
𝑝̂ − 𝐸 = 0.12 − 0.134144122 =̇ − 0.014
𝑝̂ + 𝐸 = 0.12 + 0.134144122 =̇ 0.254
It is not possible to have −1.4% of the beef packages that are contaminated. Thus, the
lower confidence limit will be accepted as 0%
In conclusion, we are 95% confident that the proportion of all beef packages that are
contaminated are between 0% and 25.4%.

Page | 121
Practice

1) In a recent study conducted in May, 2016, it was reported that 85% of the 900 call
centre workers sampled are working from home. Using a 90% confidence level, estimate
the proportion of all call centre workers in Canada that are working from home.

2) From a sample of 20 youths, 14 stated that they will being voting in the Federal
Elections. Using a 95% confidence level, determine the confidence interval for the
proportion of all youths who will vote in the Federal Elections.

3) In a poll, respondents were asked whether they felt vulnerable to identity theft. Of the
981 surveyed, 505 stated “yes”. Using a 99% confidence level, determine the confidence
interval for the population proportion that feels vulnerable to identity theft.

4) In a small city, the number of car accidents recorded in the last 6 months was 28. Of
those accidents, 12 of them resulted in injuries. Construct a 90% confidence interval to
estimate the proportion of all accidents that do not result in injuries.

Page | 122
6.4 – Hypothesis Testing of Population Mean of a Large Sample

The second type of making statistical inference is to perform hypothesis testing.

Hypothesis testing refers to the process of using statistics to decide whether to accept or reject
hypotheses that are made. A hypothesis is a claim or statement about a property pertaining to
a population.

The following are examples of situations in which hypothesis testing would be useful:

1) A pilot project is conducted to determine which software program is more suitable for
students. The hypotheses might be:
o Software A is better
o Software B is better
2) Which pet food brand is more suitable for senior dogs?
The hypotheses might be:
o Pet food brand A is better
o Pet food brand B is better
3) Is a medication effective in lowering cholesterol for women?
The hypotheses might be:
o The medication is effective because cholesterol is found to be lowered
o The medication is not effective because cholesterol is not found to be lowered

Types of Hypotheses

There are 2 types of hypotheses.

1) The null hypothesis is a statement that the value of a population parameter is equal to
some claimed value. It is denoted by the symbol 𝐻0
2) The alternative hypothesis is a statement that the parameter has a value that differs
from the null hypothesis. It is denoted by the symbol 𝐻𝐴

Method of Hypothesis Testing

There are 6 main steps in the method of hypothesis testing:

1) Hypotheses

a) Identify the claim to be tested

Page | 123
b) Identify the Null Hypothesis and the Alternative Hypothesis
c) Express the statements using symbolic form

2) Identify the Significance Level

A significance level is represented by 𝛼. It defines the region in percentage where the


null hypothesis is rejected.
Common values of 𝛼 are 0.05, 0.01, and 0.10.

3) Sampling Distribution

Identify the sampling distribution relevant to the hypothesis testing.


When the sample size is sufficiently large, 𝑛 ≥ 30, the normal distribution is used.
When the sample size is small, 𝑛 < 30, the 𝑡 distribution is used.

4) Test Statistic

The test statistic is a value used in making a decision about the null hypothesis.
Calculate the test statistic using the relevant formula:

Sampling Test Statistic


Distribution

Normal (𝑧) 𝑥̅ − 𝜇
𝑧= 𝑠
√𝑛

𝑡-distribution 𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛

5) Critical Regions

Depending on the hypothesis testing being performed, there are three types of critical
regions
- Two-tailed test
- Left-tailed test
- Right-tailed test

Page | 124
a) Identify the correct type of test.
b) Identify the rejection and acceptance regions using 𝛼
c) Draw the graph and identify the critical value that separates the regions

6) Conclusion

Making a conclusion involves


a) Identify the location of the value of the test statistic.
b) Make a decision to either reject the null hypothesis or fail to reject the null
hypothesis

Errors in Hypothesis Testing


During the process of hypothesis, we must be careful in avoiding the following errors:

1) Type I error: the mistake of rejecting the null hypothesis when it is actually true

2) Type II error: The mistake of failing to reject the null hypothesis when it is actually false

Decision When the null hypothesis When the null hypothesis


is true is false

We decide to reject Type I Error Correct decision


the null hypothesis

We fail to reject the Correct decision Type II Error


null hypothesis

Page | 125
Example 1

A food manufacturing company produces fruit jams with 11 g of sugar in each jar. The Health
& Nutrition Department took a sample of 40 jars of jam to ensure that the amount of sugar in
the jam matches the amount stated on the nutritional label. The sample mean amount of
sugar was recorded to be 11.4 g with a standard deviation of 0.6 g. Is there enough evidence
to conclude that there is a mismatch between the nutritional information and the product?
Perform the hypothesis test at a 10% level of significance.

Following the 6 steps in the method of hypothesis testing:


1) Hypotheses

𝐻0 : There is no mismatch between the nutritional information and the product, 𝜇 = 11


𝐻𝐴 : There is a mismatch between the nutritional information and the product, 𝜇 ≠ 11

Notice that the null hypothesis is written to state that the mean amount of sugar of all
jams is equal to the claimed value, 11 g
The alternative hypothesis is a statement that differs from the null hypothesis. If there is
a mismatch between the nutritional information and the product, then the mean would
deviate from 11 g significantly.

2) Identify the significance level

The hypothesis testing is performed at a 10% level of significance.


Thus, 𝛼 = 0.10

3) Sampling distribution
The sample size is 40 jars of jam, which is considered as a large sample. Therefore, the
sampling distribution will be a normal distribution.
Using the following table, we could record information about the population and the
sampling distribution:

Population Sample
𝜇 = 11 𝑛 = 40
𝑥̅ = 11.4
𝑠 = 0.6

Page | 126
4) Test Statistic

Since the sampling distribution is a normal distribution, the test statistic will be
calculated with
𝑥̅ − 𝜇
𝑧= 𝑠
√𝑛

11.4 − 11
𝑧=
0.6
√40

0.4
𝑧=
0.094868329

𝑧 =̇ 4.22

5) Critical regions

Because the alternative hypothesis is that 𝜇 ≠ 11, we are testing whether 𝑥̅ is


significantly different from the value of 11. This means that we are testing whether 𝑥̅
deviates so far from the value of 11 that there is a very low probability for that deviation
to occur.
This results in a two-tailed test, where the two tails will be the critical regions.
In Step 2, 𝛼 = 0.10
𝛼
Thus, in a two-tailed test, the area of each tail will be 2 = 0.05

The shaded region is known as the acceptance region and the non-shaded region is
known as the rejection region.

Page | 127
The rejection region is 5% each, totaling to the value of 𝛼 = 0.10
To locate whether the test statistic is in the acceptance or rejection region, determine
the critical value that separates the regions.
The z-score -1.645 separates the lower 5% with the remaining area, and the z-score
1.645 separates the upper 5% with the remaining area.
From Step 4, the test statistic is 4.22, which is a z-score that is located in the rejection
region.

6) Conclusion

Since the test statistic is in the rejection region, the null hypothesis is rejected.
We can conclude that there is a mismatch between the nutritional information and the
product because it was found that the mean amount of sugar is significantly different
from 11 g, which is the value on the nutritional label.

Example 2

Do female students perform better than male students in math?

A research study was performed in attempt to answer the above question. The mean math
score of male students is 76.0%. From a random sample of 34 female students, it was found
that the mean math score was 78.4% with a standard deviation of 14.2%. Perform a
hypothesis testing with a 10% level of significance.

1) Hypotheses
𝐻0 : Female students do not perform better than male students in math, 𝜇 = 76.0
𝐻𝐴 : Female students perform better than male students in math, 𝜇 > 76.0

2) Identify the significance level


𝛼 = 0.10

3) Sampling distribution

Population Sample
𝜇 = 76.0 𝑛 = 34
𝑥̅ = 78.4
𝑠 = 14.2

The sample is considered to be a large sample.

The sampling distribution will be a normal distribution.

Page | 128
4) Test Statistic

𝑥̅ − 𝜇
𝑧= 𝑠
√𝑛

78.4 − 76.0
𝑧=
14.2
√34

2.4
𝑧=
2.435279909

𝑧 =̇ 0.99

5) Critical Regions

Because the alternative hypothesis is that 𝜇 > 76.0, we are testing whether 𝑥̅ is
significantly greater than the value of 76.0. This means that we are testing whether 𝑥̅
deviates so far above the value of 76.0 that there is a very low probability for that
deviation to occur.

This results in an upper-tail test, where the upper tail will be the critical region.

In Step 2, 𝛼 = 0.10

Thus, the area of the upper tail will be 10%

The shaded region represents the rejection region. The critical value is 𝑧 = 1.28

Page | 129
From Step 4, the test statistic is 0.99, which is a z-score that is located below the critical
value. The test statistic is located in the acceptance region.

6) Conclusion

Since the test statistic is in the acceptance region, it is concluded that we fail to reject
the null hypothesis.
We can conclude that there is not enough evidence to state that female students
perform significantly better than male students in math.

Practice

1) The mean daily yield of a chemical produced by a pharmaceutical company is 880 metric
tons. The quality control department would like to determine whether this average has
changed recently. A random sample of 50 daily yields were taken and the sample mean
was 872 metric tons. The standard deviation was 23 metric tons. Perform a hypothesis
testing using a 5% level of significance.

2) It is recommended that the daily sodium intake should not exceed 2300 mg. To
determine whether Canadians are exceeding this limit, a random sample of 300
Canadians were surveyed. It was found that the sample mean was 2550 mg and sample
standard deviation was 923 mg. Using a 10% level of significance, test the hypothesis to
determine whether Canadians are exceeding the daily sodium intake limit.

3) A herbal supplement was tested to determine its effectiveness in lowering cholesterol. A


random sample of 49 participants were prescribed the supplement. Cholesterol levels
were measured before and after the treatment. Before taking the supplement, the
mean cholesterol level was recorded to be 141 mg/dL. After taking the supplement, the
mean cholesterol level was recorded to be 139 mg/dL with a standard deviation of 11.7
mg/dL. Perform a hypothesis testing, with a 10% level of significance, to determine
whether the supplement was effective in lowering cholesterol level.

4) The marketing department wished to determine the effectiveness of a new


advertisement they used. Prior to the new advertisement, sales were at $64 000. After
the new advertisement was launched, 100 store locations were sampled and the mean
sales was found to be $66 000 with a sample standard deviation of $10 750. Did sales
increase significantly, indicating that the advertisement was effective? Perform a
hypothesis testing at a 2% level of significance.

Page | 130
6.5 – Hypothesis Testing of Population Mean of a Small Sample

Recall that when a small sample (𝑛 < 30) is taken from a population, the sampling distribution
tends to following a 𝑡-distribution.

When performing a hypothesis testing in a situation where a small sample is collected, the
method of hypothesis testing consists of the same 6 steps:
1) Hypotheses
2) Identify the Significance Level
3) Sampling Distribution
4) Test Statistic
5) Critical Regions
6) Conclusion
When calculating the test statistic, the relevant formula for a small sample would be the 𝑡
distribution formula:

Sampling Test Statistic


Distribution

Normal (𝑧) 𝑥̅ − 𝜇
𝑧= 𝑠
√𝑛

𝑡-distribution 𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛

To determine the critical values that define the critical regions, 𝑡 values would be used.

Page | 131
Example 1
An ice cream company claimed that its product contained on average 500 calories per pint. To
test this claim, 24 one-pint containers were analyzed, giving a mean of 507 calories and a
standard deviation of 21 calories. Test the claim at a 2% level of significance.
1) Hypotheses
𝐻0 : The claim is true, there is on average 500 calories per pint of ice cream 𝜇 = 500
𝐻𝐴 : The claim is not true, the average calories per pint of ice cream is not 500 calories,
𝜇 ≠ 500

2) Identify the significance level


𝛼 = 0.02

3) Sampling distribution
The sample size is 24, which is a small sample. Therefore, the sampling distribution will
be a 𝑡 distribution

Population Sample
𝜇 = 500 𝑛 = 24
𝑥̅ = 507
𝑠 = 21

4) Test statistic

Since the sampling distribution is a 𝑡 distribution with 𝑑𝑓 = 23, the test statistic will be
calculated with
𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛

507 − 500
𝑡=
21
√24

7
𝑡=
4.28660705

𝑡 =̇ 1.633

5) Critical regions

This is a two-tailed test. Since 𝛼 = 0.02, the area in each tail is 0.01 or 1%

Page | 132
The test statistic, 𝑡 =̇ 1.633, is located in the accepted region.

6) Conclusion
Since the test statistic is in the acceptance region, we fail to reject the null hypothesis.
We can conclude that there is not enough evidence to say that the claim that the ice
cream contained on average 500 calories per pint is not true.

Example 2
A clinical trial was conducted to test the effectiveness of an antibiotic. Before treatment with the drug,
16 subjects had a mean duration of bacterial infection of 5.6 days. After treatment with the drug, the 16
subjects had a mean duration of bacterial infection of 4.0 days and a standard deviation of 1.6 days.
Perform a hypothesis testing to test the claim that the antibiotic is effective in lowering the duration of
bacterial infection. Test the claim at 10% level of significance.

1) Hypotheses
𝐻0 : The antibiotic is not effective in lowering the mean duration of infection, 𝜇 = 5.6
𝐻𝐴 : The antibiotic is effective in lowering the mean duration of infection, 𝜇 < 5.6

2) Identify the significance level


𝛼 = 0.10
3) Sampling distribution

Population Sample
𝜇 = 5.6 𝑛 = 16
𝑥̅ = 4.0
𝑠 = 1.6

Page | 133
4) Test statistic

𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛

4.0 − 5.6
𝑡=
1.6
√16

−1.6
𝑡=
0.4

𝑡 =̇− 4.000

5) Critical Regions
This is a lower-tailed test, with 𝛼 = 0.10, or 10% of the area in the lower tail.

The test statistic, 𝑡 =̇− 4.000, is located in the rejection region.


6) Conclusion

Since the test statistic is in the rejection region, we reject the null hypothesis.
We conclude that there is evidence that the antibiotic is effective because it significantly
lowered the duration of bacterial infection.

Page | 134
Practice

1) A simple random sample of the weights of 19 green M&Ms has a mean of 0.8635 g and
a standard deviation of 0.0570 g.
Use a 5% significance level to test the claim that the mean weight of all green M&Ms is
equal to 0.8535 g, which is the mean weight required so that M&Ms have the weight
printed on the package level.
Do green M&Ms appear to have weights consistent with the package label?

2) The director of admissions at a university claimed that families with an income of $30,
000 a year contributed an average of $6000 per family toward a child’s education. A
sample of 20 such families whose children attended the university revealed a mean
contribution of $6200 and a standard deviation of $300. Test the claim with a 5% level
of significance.

3) A plastic has a mean breaking strength of 27 pounds per square inch and a standard
deviation of 6 pounds per square inch. A new process is developed and will replace the
old one, provided that there is substantial evidence that it improves the strength of the
product. A random sample of 15 pieces made with the new process gives a sample
mean of 30 pounds per square inch and a standard deviation of 5.3 pounds per square
inch. Is there sufficient evidence to suggest that the strength of the product has
increased at a 1% level of significance?

4) A cereal is packaged to contain 16 ounces, on the average. A consumer agency has


received complaints claiming that packages of the cereal contain less than 16 ounces. To
test the claim that the mean content is less than 16 ounces, the agency randomly selects
25 packages and finds the sample mean to be 15.1 ounces and standard deviation of 3
ounces. Perform a hypothesis testing to test the claim at a 10% level of significance.

5) If in the above question, the null hypothesis was rejected, what type of error was made
in making the conclusion?

Page | 135
6.6 – Hypothesis Testing of Population Proportion of a Large Sample

When testing a claim that involves a population proportion, the same steps of hypothesis
testing are used. However, the test statistic is calculated using a specific formula that is relevant
to population proportion, 𝑝
Recall that the sample proportion is represented by 𝑝̂ , and that 𝑞̂ = 1 − 𝑝̂
In the case of a large sample, the sampling distribution for the parameter, 𝑝, is the normal
distribution.
The test statistic is calculated by:
𝑝̂ − 𝑝
𝑧=
𝑝𝑞

𝑛
Example 1
In the 1970s, 15 out of 405 (3.7%) teachers hired in St. Louis County, Missouri, were African
American. In St. Louis County and a nearby city, 15.4% of teachers were African American.
The Equal Employment Opportunity Commission (EEOC) filed a lawsuit against the school
boards for discrimination against African Americans and won the case. The case depended
largely on a test of hypotheses.
If 15.4% is the proportion of African American teachers hired the in general population, at a
2.5% level of significance, perform a hypothesis testing to determine whether it is a case of
discrimination in St. Louis County.

1) Hypotheses
𝐻0 : There is no discrimination against African American teachers in the hiring process,
𝑝̂ = 0.154

𝐻𝐴 : There is discrimination against African American teachers in the hiring process,


𝑝̂ < 0.154

2) Identify the significance level


𝛼 = 0.025

Page | 136
3) Sampling distribution

Population Sample
𝑝 = 0.154 𝑛 = 405
𝑞 = 0.846 𝑝̂ = 0.037
𝑞̂ = 0.963

4) Test statistic
𝑝̂ − 𝑝
𝑧=
𝑝𝑞

𝑛

0.037 − 0.154
𝑧=
√(0.154)(0.846)
405

−0.117
𝑧=
0.017935687

𝑧 = −6.52

5) Critical regions

Since this is a lower-tailed test, the area in the lower tail is 2.5%. The critical value
separating the acceptance and rejection regions is 𝑧 = 1.96

The test statistic is located in the rejection region.

Page | 137
6) Conclusion

Since the test statistic is in the rejection region, it is concluded that we reject the null
hypothesis.
We can conclude that the proportion of African American teachers is significantly lower.
There is evidence to state that there is discrimination against African American teachers
in the hiring process.

Practice

1) A survey was conducted to determine the proportion of population that meets the
minimum daily fiber intake requirement, which is 25 g for women and 38 g for men.
Among the 514 people surveyed, 90% claimed that they do not meet the minimum daily
fiber intake requirement. Use a 5% significance level to test the claim that 85% of the
population does consume enough fiber per day.

2) In a previous educational research study, it was found that 14% of college students drop
out within the first year of their college studies. It was suspected that the dropout rate
has increased over the years. In a sample of 32 colleges, it was found that the dropout
rate is 17.3%. Use a 10% significance level to test the claim that the dropout rate has
increased significantly.

3) One organization claimed that about 20% of Canadian adults have a physically active
lifestyle. A random sample of 100 Canadians were surveyed. It was found that 79 adults
stated that they do not have a physically active lifestyle. Using a 10% level of
significance, perform a hypothesis testing to test the claim that 20% of Canadian adults
have a physically active lifestyle.

4) Of all the machine parts produced, a machine that manufactures these parts produces
4.5% of defectives. A technician made an adjustment and believed that it would help
decrease the proportion of defectives produced. A sample of 60 parts were taken to test
whether the proportion of defectives are decreased. It was found that 2 of the 60 parts
were defective. Use a 2% level of significance to test whether there is a decrease in the
proportion of defectives produced.

Page | 138
Lab Activities

Unit 1
Activity: Graphical Representation of Data

The following data set shows the amount of consumption measured in Quadrillion BTUs
(British Thermal Unit) in the U.S. in 2009 in the following areas:
Electricity – 4.388
Natural Gas – 4.694
Propane – 0.492
Fuel Oil – 0.584

a) Determine the most appropriate graph to visually represent this data set.
b) Using Microsoft Excel, graph the data.
c) Can this data set be represented using a frequency distribution? Explain.
d) Using Microsoft Excel, create a stem-and-leaf plot of the data set.

Unit 2
Activity: Descriptive Statistics

The following set of data are numbers of manatee deaths caused each year by collision with
watercrafts. Manatees are large mammals that live underwater and near waterways.
78 81 95 73 69 79 92 73 90 97

a) Using Excel, construct a histogram


b) Using Excel, determine the mean
c) Using Excel, determine the median
d) Using Excel, determine the mode
e) Using Excel, determine the midrange
f) Describe the distribution
g) Where are the mean and median located?
h) Using Excel, determine the range, variance, and standard deviation

Page | 139
Unit 3
Activity 1: Linear Regression

Given the following data set:

Years of Salary in
experience Thousands Salary of Employees
12 39 60
16 41 55

Salary in thousands
6 33 50
23 44
45
27 48
8 34 40

5 32 35
19 44 30
23 46 0 0.5 1
13 37 Years of Experience
16 43
8 37

a) Using Excel, create a table to calculate the equation of the linear regression line.
b) Graph the data using a scatterplot.
c) Using Excel’s feature to create a trend line on the scatterplot.
d) Verify that your calculated trend line is the same as the one provided by Excel.
𝑥 𝑦 𝑥𝑦 𝑥2

Sum

e) Using Excel, determine the linear correlation coefficient of the above set of data using
this formula
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)
𝑟=
(𝑛 − 1)𝑠𝑥 𝑠𝑦

Page | 140
Unit 3
Activity 2: Linear Regression

Police investigators and forensic scientists can often predict the height of a suspect just from
a footprint that was left behind at a crime scene. This is because the size of a person’s foot
tends to relate to the height of a person. As you might already know, this is a positive
relationship. The taller the person, the larger their feet tends to be.

Given the bivariate data below,

Length of footprint 29.8 29.3 27.5 31.1 32.6 22.5


(cm)
Height 176.1 179.4 172.2 184.2 179.0 157.3
(cm)

Use Microsoft Excel for the following:

a) Create a scatter plot to display the data


b) Determine the least squares regression line
c) Using the equation of the line, predict the foot length of a person 155 cm tall.
d) If a police investigator found a footprint with a length of 34 cm, estimate the height of
the suspect.
e) Determine the linear correlation coefficient
f) Describe the relationship between the length of a footprint and the height of a
person.

Page | 141
Unit 4
Activity: Binomial Probability Distribution

Adults are randomly selected with replacement from a group. From that group, a survey
showed that 26% of the adults are smokers.

a) Use Excel to create a probability distribution table and histogram to represent the
experiment of selecting 6 adults. Let 𝑥 represent the number of smokers selected.

b) Create a formula on Excel to calculate the probability of selecting 5 non-smokers


when 10 adults are randomly selected.

Unit 5
Activity: Normal Distribution

Using Excel, create a formula that allows you to input 𝑥, 𝜇, and 𝜎, to calculate the 𝑧 score.
Using the formula you have created, answer the following questions:

a) If human body temperature is normally distributed with a mean of 37.1 degrees


Celsius and a standard deviation of 0.2 degrees Celsius, then what is the 𝑧 score of a
person with a body temperature of 38 degrees Celsius?

b) Referring to the above question, what is the 90th percentile for body temperatures?
Use the 𝑧 score to determine the value of body temperature.

c) If the birth weight of a baby is normally distributed with a mean of 3.4 kg and a
standard deviation of 800 g, then what is the 𝑧 score of a baby with a birth weight of
3.6 kg?

d) Referring to the above question, what is the probability for a baby to have a birth
weight of less than 3.6 kg?

e) What is the birth weight of a baby at the 40th percentile?

f) Above what weight do about 87.70% of the weights occur?

g)

Page | 142
Unit 6
Activity 1: Estimating a Population Parameter

Twelve different computer games that involved violence activities were sampled randomly.
The duration times (in seconds) of violence activities were recorded, with the times listed
below. Use the sample data to construct of 95% confidence interval estimate of the mean
duration time of violence that all computer games of this type would involve.

84 14 583 50 3 57 207 43 178 0 2 57

Unit 6
Activity 2: Hypothesis Testing

Listed below are speeds in km/h measured from southbound traffic on a Toronto city street.
This simple random sample was obtained at 3:30pm on a weekday. Use a 5% significance
level to test the claim that city drivers are driving less than the speed limit of 70 km/h on that
city street.

62 61 61 57 61 54 59 58 59 69 60 67

Page | 143
Answers

Chapter Answers

Page | 144

You might also like