Professional Documents
Culture Documents
Data Index
Probability and
Statistics Index
Graphs Index
What is Data?
What is Data?
Discrete and Continuous Data
Advanced: Analog and Digital Data
Bar Graphs
Pie Charts
Dot Plots
Line Graphs
Scatter (x,y) Plots
Pictographs
Histograms
Frequency Distribution
Stem and Leaf Plots
Cumulative Tables and Graphs
Graph Paper Maker
Surveys
How to Do a Survey
Survey Questions
Showing the Results of a Survey
Accuracy and Precision
Measures of Spread
The Range
Quartiles and the Interquartile Range
Percentiles
Mean Deviation
Standard Deviation
Standard Deviation Calculator
Standard Deviation Formulas
Comparing Data
Probability
Probability
The Probability Line
The Spinner
The Basic Counting Principle
Relative Frequency
Activities:
Events
Complement
Probability: Types of Events
Independent Events
Dependent Events: Conditional Probability
Tree Diagrams
Mutually Exclusive Events
Advanced
Random Variables
Random Variables
Random Variables - Continuous
Random Variables - Mean, Variance and Standard Deviation
Normal Distribution
Standard Normal Distribution Table
Skewed Data
What is Data?
Data is a collection of facts, such as numbers, words, measurements, observations or even just
descriptions of things.
Qualitative vs Quantitative
Data can be qualitative or quantitative.
Discrete data can only take certain values (like whole numbers)
Continuous data can take any value (within a range)
Qualitative:
Quantitative:
Discrete:
He has 4 legs
He has 2 brothers
Continuous:
He weighs 25.5 kg
He is 565 mm tall
More Examples
Qualitative:
Quantitative:
Height (Continuous)
Weight (Continuous)
Petals on a flower (Discrete)
Customers in a shop (Discrete)
Collecting
Data can be collected in many ways. The simplest way is direct observation.
Example: you want to find how many cars pass by a certain point on a road in a 10-minute
interval.
So: stand at that point on the road, and count the cars that pass by in that interval.
Census or Sample
A Census is when we collect data for every member of the group (the whole "population").
A Sample is when we collect data just for selected members of the group.
You can ask everyone (all 120) what their age is. That is a census.
Or you could just choose the people that are there this afternoon. That is a sample.
A census is accurate, but hard to do. A sample is not as accurate, but may be good enough, and
is a lot easier.
Language
Data or Datum?
"Data" is the plural so we say "the data are available", but it is also a collection of facts, so "the
data is available" is fine too.
Discrete and Continuous Data
Data can be Descriptive (like "high" or "fast") or Numerical (numbers).
Discrete Data
Discrete Data can only take certain values.
Example: the number of students in a class (you can't have half a student).
Continuous Data
Examples:
A person's height: could be any value (within the range of human heights), not just certain
fixed heights,
Time in a race: you could even measure it to fractions of a second,
A dog's weight,
The length of a leaf,
Lots more!
Analog and Digital
Arrow Barks!
Let's record him barking:
Arrow's bark is analog. It is actual pressure waves in the air, so it is physical with continuous
change.
And the microphone converts that pressure into an electrical signal. It is still analog (the
electricity is physical, and has continuous change).
So the "sound" is now "12, 25, 39, 52, 68, 71, 78, 82, 82, 79, 70, 59, ..." (in fact it would be
inbinary, so would be something like "000011000001100100100111...")
It is now digital!
Notice the digital data has sudden jumps up and down ... it does not change continuously.
It is Discrete Data: that means it can only be certain values (such as 1, 2, 3, etc).
Digital data is very easy for computers and phones to use. It can be saved, shared electronically,
sent all over the world quickly and more.
It should sound very much like the original bark (but not perfectly so!)
Digital Pictures
A similar thing happens when you take a picture.
Light (which is analog) gets projected onto a grid of millions of little sensors inside the camera:
The camera measures the light at each point and produces numbers.
So the "picture" is now "A1DDF9, ADE3FF, B5E7FE, AFE4F8, ...", which are hexadecimal color
numbers, (that are used internally in binary, so would be something like
"101000011101110111111001...")
Look really closely at a digital picture ... it is made up of millions of little squares called "pixels":
Digital IS Numbers
So digital pictures, music, videos etc are actually stored on your device as numbers.
Numbers rule!
Bar Graphs
A Bar Graph (also called Bar Chart) is a graphical display of data using bars of different heights.
Imagine you just did a survey of your friends to find which kind of movie they liked best:
We can use bar graphs to show the relative sizes of many things, such as what type of car
people have, how many customers a shop has on different days and so on.
People: 35 30 10 25 40 5
Grade: A B C D
Students: 4 12 10 2
You can create graphs like that using our Data Graphs (Bar, Line and Pie) page.
But when you have continuous data (such as a person's height) then use a Histogram.
It is best to leave gaps between the bars of a Bar Graph, so it doesn't look like a Histogram.
Pie Chart
Pie Chart: a special chart that uses "pie slices" to show relative sizes of data.
Imagine you survey your friends to find the kind of movie they like best:
It is a really good way to show relative sizes: it is easy to see which movie types are most liked,
and which are least liked, at a glance.
You can create graphs like that using our Data Graphs (Bar, Line and Pie) page.
Next, divide each value by the total and multiply by 100 to get a percent:
Now to figure out how many degrees for each "pie slice" (correctly called a sector).
Draw a circle.
Finish up by coloring each sector and giving it a label like "Comedy: 4 (20%)", etc.
Another Example
You can use pie charts to show the relative sizes of many things, such as:
Here is how many students got each grade in the recent test:
A B C D
4 12 10 2
Dot Plots
A Dot Plot is a graphical display of data using dots.
A survey of "How long does it take you to eat breakfast?" has these results:
Minutes: 0 1 2 3 4 5 6 7 8 9 10 11 12
People: 6 2 3 5 2 5 0 0 2 3 7 4 1
Which means that 6 people take 0 minutes to eat breakfast (they probably had no breakfast!), 2
people say they only spend 1 minute having breakfast, etc.
And here is the dot plot:
Another version of the dot plot has just one dot for each data point like this:
Example: (continued)
But notice that we need to have lines and numbers on the side so we can see what the dots
mean.
Grouping
Example: Access to Electricity across the World
Some people don't have access to electricity (they live in remote or poorly served areas). A
survey of many countries had these results:
Access to Electricity
Country
(% of population)
Algeria 99.4
Angola 37.8
Argentina 97.2
Bahrain 99.4
Bangladesh 59.6
... ... etc
But hang on! How do we make a dot plot of that? There might be only one "59.6" and one
"37.8", etc. Nearly all values will have just one dot.
In this case let's try rounding every value to the nearest 10%:
Access to Electricity
Country (% of population,
nearest 10%)
Algeria 100
Angola 40
Argentina 100
Bahrain 100
Bangladesh 60
... ... etc
Now we count how many of each 10% grouping and these are the results:
Access to Electricity
Number of
(% of population,
Countries
nearest 10%)
10 5
20 6
30 12
40 5
50 4
60 5
70 6
80 10
90 15
100 34
So there were 5 countries where only 10% of the people had access to electricity, 6 countries
where 20% of the people had access to electricity, etc
Line Graphs
Line Graph: a graph that shows information that is connected in some way (such as change
over time)
You are learning facts about dogs, and each day you do a short test to see how good you are.
These are the results:
Let's make the vertical scale go from $0 to $800, with tick marks every $200
A Title
Vertical scale with tick marks and labels
Horizontal scale with tick marks and labels
Data points connected by lines
Scatter Plots
A Scatter (XY) Plot has points that show the relationship between two sets of
data.
In this example, each dot shows one person's weight versus their height.
(The data is plotted on the graph as " Cartesian (x,y) Coordinates ")
Example:
The local ice cream shop keeps track of how much ice cream they sell versus the noon
temperature on that day. Here are their figures for the last 12 days:
14.2 $215
16.4 $325
11.9 $185
15.2 $332
18.5 $406
22.1 $522
19.4 $412
25.1 $614
23.4 $544
18.1 $421
22.6 $445
17.2 $408
It is now easy to see that warmer weather leads to more sales, but the relationship is not
perfect.
Line of Best Fit
We can also draw a "Line of Best Fit" (also called a "Trend Line") on our scatter plot:
Try to have the line as close as possible to all points, and as many points above the line as
below.
Here we use linear extrapolation to estimate the sales at 29 C (which is higher than any
value we have).
Careful: Extrapolation can give misleading results because we are in "uncharted territory".
As well as using a graph (like above) we can create a formula to help us.
We can estimate a straight line equation from two points from the graph above
Let's estimate two points on the line near actual values: (12, $180) and (25, $610)
= $610 $18025 12
= $43013
= 33 (rounded)
Now put the slope and the point (12, $180) into the "point-slope" formula :
y y1 = m(x x1)
y 180 = 33(x 12)
y = 33x 216
INTERPOLATING
EXTRAPOLATING
The values are close to what we got on the graph. But that doesn't mean they are more (or less)
accurate. They are all just estimates.
Don't use extrapolation too far! What sales would you expect at 0 ?
Note: we used linear (based on a line) interpolation and extrapolation, but there are many
other types, for example we could use polynomials to make curvy lines, etc.
Correlation
When the two sets of data are strongly linked together we say they have a High Correlation.
Like this:
(Learn More About Correlation )
Negative Correlation
Correlations can be negative, which means there is a correlation but one value goes down as the
other value increases.
Yearly
Birth
Country Production
Rate
per Person
Note: I tried to fit a straight line to the data, but maybe a curve would work better, what do you
think?
Pictographs
A Pictograph is a way of showing data using images.
Each image stands for a certain number of things.
Here is a pictograph of how many apples were sold at the local shop over 4 months:
Note that each picture of an apple means 10 apples (and the half-apple picture means 5
apples).
But it is not very accurate: in the example above we can't show just 1 apple sold, or 2 apples
sold etc.
Four friends play a lot of tennis. Here is how many games they played this year:
Each tennis ball means 20 games played. A tennis ball can be cut to show part of 20.
Histograms
Histogram: a graphical display of data using bars of different heights.
You measure the height of every tree in the orchard in centimeters (cm)
You can see (for example) that there are 30 trees from 150 cm to just below 200 cm tall
Each month you measure how much weight your pup has gained and get these results:
They vary from 0.2 (the pup lost weight that month) to 1.6
(There are no values from 1 to just below 1.5, but we still show the space.)
Histograms are a great way to show results of continuous data, such as:
weight
height
how much time
etc.
But when the data is in categories (such as Country or Favorite Movie), we should use
aBar Chart.
Frequency Histogram
A Frequency Histogram is a special histogram that uses vertical columns to show frequencies
(how many times each score occurs):
Frequency Distribution
Frequency
Frequency is how often something occurs.
Saturday Morning,
Saturday Afternoon
Thursday Afternoon
The frequency was 2 on Saturday, 1 on Thursday and 3 for the whole week.
Frequency Distribution
By counting frequencies we can make a Frequency Distribution table.
Example: Goals
Sam's team has scored the following numbers of goals in recent games:
2, 3, 1, 2, 1, 3, 2, 3, 4, 5, 4, 2, 2, 3
Frequency Distribution: values and their frequency (how often each value occurs).
Example: Newspapers
These are the numbers of newspapers sold at a local shop over the last 10 days:
It is also possible to group the values. Here they are grouped in 5s:
Graphs
After creating a Frequency Distribution table you might like to make a Bar Graph or a Pie
Chartusing the Data Graphs (Bar, Line and Pie) page.
Example:
More Examples:
Stem "1" Leaf "5" means 15
Stem "1" Leaf "6" means 16
Stem "2" Leaf "1" means 21
etc
The "stem" values are listed down, and the "leaf" values go right (or left) from the stem values.
The "stem" is used to group the scores and each "leaf" shows the individual scores within each
group.
Sam got his friends to do a long jump and got these results:
2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0
Stem Leaf
2 35578
3 266
4 5
5 0
Note:
Say what the stem and leaf mean (Stem "2" Leaf "3" means 2.3)
In this case each leaf is a decimal
It is OK to repeat a leaf value
5.0 has a leaf of "0"
Cumulative Tables and Graphs
Cumulative
Cumulative means "how much so far".
Month Earned
March $120
April $50
May $110
June $100
July $50
August $20
The first line is easy, the total earned so far is the same as Jamie earned that month:
But for April, the total earned so far is $120 + $50 = $170 :
Do you see how we add the previous month's cumulative total to this month's earnings?
The last cumulative total should match the total of all earnings:
$120+$50+$110+$100+$50+$20 = $450
So we got it right.
So that's how to do it, add up as you go down the list and you will have cumulative totals.
Graphs
We can make cumulative graphs, too. Just plot each cumulative total:
How to Do a Survey
Surveys can help decide what needs changing, where money should be spent, what
products to buy, what problems there might be, or lots of other questions you may
have at any time.
The best part about surveys is that they can be used to answer any question about any
topic.
You can survey people (through questionnaires, opinion polls, etc) or things (like pollution
levels in a river, or traffic flow).
Four Steps
Here are four steps to a successful survey:
When doing a simple survey, you can use tally marks to show each persons answer:
Sometimes, it is helpful to be creative in how the people can respond. It makes it more fun for
both you and your respondents (the people answering the question).
Have them write down their favorite color on a piece of paper and drop it in a fish bowl.
Then, put all of the pieces of paper into piles and count them.
To help you make a good Questionnaire read our page Survey Questions.
If you survey a small group you can ask everybody (called a Census)
If you want to survey a large group, you may not be able to ask everybody so you should ask a
sample of the population (called a Sample)
When you are Sampling you should be careful who you ask.
And the surveys where people are asked to ring a number to vote are not
very accurate, because only certain types of people actually ring up!
Example: You want to know the favorite colors for people at your school, but don't have the time
to ask everyone.
stand at the gate and choose "the next person to arrive" each time
or choose people randomly from a list and then go and find them!
or you could choose every 5th person
If you choose a person and they do not want to answer, record "no answer" on the survey
form and mention how many people did not answer in your report.
After completing a sampling survey you can use the information to make a prediction as to how
the rest of the population might respond.
And your results are better when you ask more people.
Example: nationwide opinion polls survey up to 2,000 people, and the results are nearly as good
(within about 1%) as asking everyone.
By "tally" I mean add up. This usually involves lots of paperwork and computer work
(spreadsheets are useful!)
Example: For "favorite colors of my class" you can simply write tally marks like this (every fifth
mark crosses the previous 4 marks, so you can easily see groups of 5):
We have written a special page called Showing the Results of a Survey, but here is a quick
summary:
Tables
A table is a very simple way to show others the results. A table should have a title, so those
looking at it understand what results the table shows:
4 5 6 1 4
Statistics
You can also summarize the results using statistics, such as mean or standard deviation
Example: you have lots of information about how long it takes people to get to school but it may
be simpler just to present a summary such as:
Graphs
But nothing makes a report look better than a nice graph or chart.
Survey Questions
How to make a good Questionnaire!
This defines your objectives: what you want to achieve by the end of
the survey.
Example: you want to clean up the local river. You feel that with some help and some money you
could make it really beautiful again.
Questions
Now you know why you are doing a survey, start writing down the questions you will ask!
Just write down any questions you think may be useful. Don't worry about quality at this stage,
we will improve your list of questions later.
You can also ask the person about themselves (not too personal!), such as age group, male or
female, etc, so that you know the kind of people that you have been surveying.
Your Turn: Go ahead and write down the questions for your own survey!
Types of Questions
A survey question can be:
Closed ended questions are much easier to total up later on, but may stop people giving an
answer they really want.
Open-ended: Someone may answer "dark fuchsia", in which case you will need to have a
category "dark fuchsia" in your results.
Closed-ended: With a choice of only 12 colors your work is easier, but they may not be able to
pick their exact favorite color.
Example: "What do you think is the best way to clean up the river?"
Make it Open-ended: the answers won't be easy to put in a table or graph, but you may get
some good ideas, and there may be some good quotes for your report.
Example: "How often do you visit the river?"
Question Sequence
It is important that the questions don't "lead" people to the answer
Example: people may say "yes" to donate money if you ask the questions this way
But probably will say "no" if you ask the questions this way:
Go through your questions and put them in the best sequence possible
Example: I will ask people how often they visit the river (a fact) before I ask them what they feel
about pollution (an opinion)
I will ask people their general feelings about the environment before I ask them their feelings
about the river.
Neutral Questions
Your questions should also be neutral ... allowing the person to
think their own thoughts about the question.
The question "Do you love nature?" (in the example above) is a bad question as it almost
forces the person to say "Yes, of course."
Not Important
Some Importance
Very Important
But you can also make statements and see if people agree:
Possible Answers
For each "closed-ended" question try to think:
If you are not sure what people might answer, you could always try a small open ended survey
(maybe ask your friends or people in the street) to find common answers.
Trick: try to avoid neutral answers (such as "don't care") because people may choose them so
they don't have to think about the answer!
It is also helpful to have an other category in case none of the choices are satisfactory for the
person answering the question.
SCALED ANSWERS
Sometimes you could have a scale on which they can rate their feelings about the question.
Have "opposite" words at either end and a scale in between like this:
Examples:
For this type of answer the person gets to rate or rank each option.
Don't have too many items though, as that makes it too hard to answer.
Example: Please rank the following activities from 1 to 5, putting 1 next to your favorite through
to 5 for your least favorite.
___ Fishing
___ Football
___ Golf
___ Shopping
___ Sleeping
NUMBER ANSWERS
Example: "How many times did you visit the river during the past year?"
____ times
Look at each "closed-end" question and choose the best answer options.
Type your questions and answer options into a word processor or spreadsheet,
and format it neatly.
in a table,
a bar graph,
a pie chart,
or just explained in words.
Make sure each question is set up so you can present the answers in your
chosen style.
Example: you decide to have six options for "How many times do you visit the river" so the bar
graph looks best.
Test It Out
You should test your questionnaire on a few people.
It is also a good idea to time how long it takes so you can tell people "this survey only
takes __ minutes" (put in your time). Use the Stopwatch.
Take notes of any difficulties your friends have with the questionnaire, and see
what you can do to improve it.
will the questions really help you find out what you want to know?
are there some questions you can remove? (smaller surveys are easier!)
This is your last chance to make sure your questionnaire is a good one!
A table is a very simple way to show others the results. A table should have a title, so those
looking at it understand what it shows:
4 5 6 1 4
Statistics
You can also summarize the results using statistics, such as Mean, Median, Mode, Standard
Deviation and Quartiles
Example: you have lots of information about how long it takes people to get to school but it may
be simpler just to present a summary such as:
Graphs
But nothing makes a report look better than a nice graph or chart
There are many different types of graphs. Three of the most common are:
Line Graph - shows information that is somehow connected (such as change over
time)
Pie Chart - shows sizes as part of a whole (good for showing percentages).
You can create graphs like those using our Data Graphs (Bar, Line and Pie) page
People's Comments
If people have given their opinions or comments in the survey, you can present the more
interesting ones:
Example: In response to the question "How can we best clean up the river?" we received these
interesting replies:
Report
Put it all together into a report, with a nice introduction, and conclusions at the end, and you are
done!
Accuracy
Accuracy is how close a measured value is to the actual (true) value.
Precision
Precision is how close the measured values are to each other.
Examples of Accuracy and Precision:
So, if you are playing soccer and you always hit the left goal post instead of scoring, then you
are notaccurate, but you are precise!
How to Remember?
aCcurate is Correct (a bullseye).
pRecise is Repeating (hitting the same spot, but maybe not the correct spot)
Bias is a systematic (built-in) error which makes all measurements wrong by a certain amount.
Examples of Bias
In each case all measurements are wrong by the same amount. That is bias.
Degree of Accuracy
Accuracy depends on the instrument we are measuring with. But as a general rule:
The degree of accuracy is half a unit each side of the unit of measure
Examples:
When an instrument measures in "1"s
any value between 6 and 7 is measured as "7"
When an instrument measures in "2"s
any value between 7 and 9 is measured as "8"
(Notice that the arrow points to the same spot, but the measured values are different!
Read more at Errors in Measurement. )
Examples:
Question
4
5
An estimate is fine (example: cutting grass along a street, estimate the area of grass and use
the internet to find how fast grass is usually cut).
Is it possible to answer?
Can we answer it exactly (or close enough)?
Do we know what each part of it means?
Ask "does that depend on ... "
When you start counting the trees you may find lots of tiny ones ... should they be counted?
Maybe the question could be changed to
Example: How long would it take to cut the grass along the street?
What are you using to cut the grass: A lawn mower? One you sit on?
How long would it take to cut the grass along the street using our lawn mower?
Also: what does "along the street" mean? Just the grass alongside the road? Maybe you need a
map!
Original Question:
1
Improved Question:
Original Question:
2
Improved Question:
Original Question:
3
Improved Question:
Original Question:
4
Improved Question:
Original Question:
5
Improved Question:
Finding a Central Value
When you have two or more numbers it is nice to find a value for the "center".
2 Numbers
With just 2 numbers the answer is easy: go half-way between.
You can calculate it by adding 3 and 7 and then dividing the result by 2:
(3+7) / 2 = 10/2 = 5
3 or More Numbers
You can use the same idea when you have 3 or more numbers:
Answer: You calculate it by adding 3, 7 and 8 and then dividing the results by 3 (because there
are 3 numbers):
(3+7+8) / 3 = 18/3 = 6
Notice that we divide by 3 because we have 3 numbers ... very important!
The Mean
So far we have been calculating the Mean (or the Average):
Uncle Bob wants to know the average age at the party, to choose an activity.
Add up all the ages, and divide by 11 (because there are 11 numbers):
(13+13+13+13+13+13+1+1+1+1+1) / 11 = 7.5...
The Mean was accurate, but in this case it was not useful.
The Median
But you could also use the Median: simply list all numbers in order and choose the middle one:
3, 4, 7, 9, 12, 15
So we average them:
(7+9) / 2 = 16/2 = 8
The Median is 8
The Mode
The Mode is the value that occurs most often:
"13" occurs 6 times, "1" occurs only 5 times, so the mode is 13.
But Mode can be tricky, there can sometimes be more than one Mode.
When there are two modes it is called "bimodal", when there are three or more modes we call it
"multimodal".
Outliers
Outliersare values that "lie outside" the other values.
They can change the mean a lot, so we can either not use them (and say so) or use the median
or mode instead.
Mean: Add them up, and divide by 5 (as there are 5 numbers):
(3+4+4+5+104) / 5 = 24
(3+4+4+5) / 4 = 4
But please tell people you are not including the outlier.
Median: They are in order, so just choose the middle number, which is 4:
3, 4, 4, 5, 104
3, 4, 4, 5, 104
Conclusion
There are other ways of measuring central values, but Mean, Median and Mode are the most
common.
Use the one that best suits your data. Or better still, use all three!
How to Find the Mean
The mean is the average of the numbers.
It is easy to calculate: add up all the numbers, then divide by how many numbers there are.
6, 11, 7
The Mean is 8
3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
3, 7, 5, 13, 2
3 7 + 5 + 13 2 12
Mean = = = 2.4
5 5
Try it yourself!
Advanced Topic: the mean we have just looked at is also called the "Arithmetic Mean", because
there are other means such as the Geometric Mean.
Median Value
The Median is the "middle" of a sorted list of numbers.
3, 5, 12
Example:
3, 13, 7, 5, 21, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 39, 40, 56
3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 39, 40, 56
(It doesn't matter that some numbers are the same in the list.)
Two Numbers in the Middle
BUT, with an even amount of numbers things are slightly different.
In that case we find the middle pair of numbers, and then find the value that is half
way between them. This is easily done by adding them together and dividing by two.
Example:
3, 13, 7, 5, 21, 23, 23, 40, 23, 14, 12, 56, 23, 29
3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56
There are now fourteen numbers and so we don't have just one middle number, we have a pair
of middle numbers:
3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56
To find the value halfway between them, add them together and divide by 2:
21 + 23 = 44
then 44 2 = 22
(Note that 22 was not in the list of numbers ... but that is OK because half the numbers in the
list are less, and half the numbers are greater.)
Your Turn
Sort the list (drag them left or right), find the median, type in your answer.
33 and a half? That means that the 33rd and 34th numbers in the sorted list are the two
middle numbers.
So to find the median: add the 33rd and 34th numbers together and divide by 2.
Example:
3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
In order these numbers are:
3, 5, 7, 12, 13, 14, 20, 23, 23, 23, 23, 29, 39, 40, 56
Arrange them in order: {8, 15, 19, 19, 28, 29, 35}
19 appears twice, all the rest appear only once, so 19 is the mode.
Example: {1, 3, 3, 3, 4, 4, 6, 6, 6, 9}
Grouping
When all values appear the same number of times the idea of a mode is not useful. But we could
group them to see if one group has more than the others.
In groups of 10, the "20s" appear most often, so we could choose 25 (the middle of the 20s
group) as the mode.
Grouping also helps to find what the typical values are when the real world messes things up!
{35, 36, 32, 42, 58, 56, 35, 39, 46, 47, 34, 37}
It takes longer if there is break time or lunch so an average is not very useful.
30-34: 2
35-39: 5
40-44: 1
45-49: 2
50-54: 0
54-59: 2
"35-39" appear most often, so we can say it normally takes about 37 minutes to fill a pallet.
Sam is better!
This week:
Sam was better last week and this week ... but you are better over both weeks?
Please explain.
...
Maybe make a table with all the data and do the calculations yourself
Sam You
Last Week
This Week
Both Weeks
....
...
It is All True
Because you had SO MANY shots at goal this week, and did well at them, you lifted your two-
week average above Sam's.
This week:
To be fair, we should really compare the averages when your and Sam's attempts at goal is
roughly the same.
If Sam had attempted 100 shots this week, he may have scored 60 out of 100, and his two-week
average would have been about 57%, better than you.
So be careful when comparing two sets of data with widely different counts.
6, 11, 7
The Mean is 8
But sometimes we don't have a simple list of numbers, it might be a frequency table like this
(the "frequency" says how often they occur):
Score Frequency
1 2
2 5
3 4
4 2
5 1
(it says that score 1 occurred 2 times, score 2 occurred 5 times, etc)
But rather than do lots of adds (like 3+3+3+3) it is easier to use multiplication:
And rather than count how many numbers there are, we can add up the frequencies:
Mean = 21 + 52 + 43 + 24 + 152 + 5 + 4 + 2 + 1
Mean = 2 + 10 + 12 + 8 + 514
= 3714 = 2.64...
Isabella went up and down the street to find out how many parking spaces each house has. Here
are her results:
Parking
Frequency
Spaces
1 15
2 27
3 8
4 5
Answer:
15 1 + 27 2 + 8 3 + 5 4
Mean =
15+27+8+5
15 + 54 + 24 + 20
=
55
= 2.05...
Notation
Now you know how to do it, let's do that last example again, but using formulas.
This symbol (called Sigma) means "sum up"
(read more at Sigma Notation)
(where f is frequency)
And the formula for calculating the mean from a frequency table is:
So now we are ready to do our example above, but with correct notation.
1 15
2 27
3 8
4 5
Example: (continued)
From the previous example, calculate f x in the right-hand column and then do totals:
x f fx
1 15 15
2 27 54
3 8 24
4 5 20
TOTALS: 55 113
59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62, 68, 65, 56, 59, 68, 61, 67
To find the Mean Alex adds up all the numbers, then divides by how many numbers:
59+65+61+62+53+55+60+70+64+56+58+58+62+62+68+65+56+59+68+61+67
Mean =
21
= 61.38095...
To find the Median Alex places the numbers in value order and finds the middle number.
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61 , 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
Median = 61
To find the Mode , or modal value, Alex places the numbers in value order then counts how
many of each number. The Mode is the number which appears most often (there can be more
than one mode):
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62 , 64, 65, 65, 67, 68, 68, 70
62 appears three times, more often than the other values, so Mode = 62
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
So 2 runners took between 51 and 55 seconds, 7 took between 56 and 60 seconds, etc
Oh No!
Suddenly all the original data gets lost (naughty pup!)
... can we help Alex calculate the Mean, Median and Mode from just that table?
The answer is ... no we can't. Not accurately anyway. But, we can make estimates.
Estimating the Mean from Grouped Data
So all we have left is:
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
The groups (51-55, 56-60, etc), also called class intervals, are of width 5
The midpoints are in the middle of each class: 53, 58, 63 and 68
Think about the 7 runners in the group 56 - 60: all we know is that they ran somewhere
between 56 and 60 seconds:
So we take an average and assume that all seven of them took 58 seconds.
Midpoint Frequency
53 2
58 7
63 8
68 4
Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8 people took 63 sec and 3 took 68
sec". In other words we imagine the data looks like this:
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68
Then we add them all up and divide by 21. The quick way to do it is to multiply each midpoint by
each frequency:
53 2 106
58 7 406
63 8 504
68 4 272
Totals: 21 1288
And then our estimate of the mean time to complete the race is:
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
The median is the middle value, which in our case is the 11th one, which is in the 61 - 65 group:
But if we want an estimated Median value we need to look more closely at the 61 - 65 group.
We call it "61 - 65", but it really includes values from 60.5 up to (but not including) 65.5.
Why? Well, the values are in whole seconds, so a real time of 60.5 is measured as 61. Likewise
65.4 is measured as 65.
At 60.5 we already have 9 runners, and by the next boundary at 65.5 we have 17 runners. By
drawing a straight line in between we can pick out where the median frequency of n/2 runners
is:
And this handy formula does the calculation:
(n/2) B
Estimated Median = L + w
G
where:
L = 60.5
n = 21
B=2+7=9
G=8
w=5
= 60.5 + 0.9375
= 61.4375
Estimating the Mode from Grouped Data
Again, looking at our data:
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
We can easily find the modal group (the group with the highest frequency), which is 61 - 65
But the actual Mode may not even be in that group! Or there may be more than one mode.
Without the raw data we don't really know.
fm fm-1
Estimated Mode = L + w
(fm fm-1) + (fm fm+1)
where:
In this example:
L = 60.5
fm-1 = 7
fm = 8
fm+1 = 4
w=5
87
Estimated Mode = 60.5 + 5
(8 7) + (8 4)
= 60.5 + (1/5) 5
= 61.5
(Compare that with the true Mean, Median and Mode of 61.38..., 61 and 62 that we got at the
very start.)
Now let us look at two more examples, and get some more practice along the way!
Example: You grew fifty baby carrots using special soil. You dig them up and measure
their lengths (to the nearest mm) and group the results:
155 - 159 2
160 - 164 6
165 - 169 8
170 - 174 9
175 - 179 11
180 - 184 6
185 - 189 3
Mean
Midpoint Frequency
Length (mm)
x f fx
8530
Estimated Mean = = 170.6 mm
50
Median
The Median is the mean of the 25th and the 26th length, so is in the 170 - 174 group:
(50/2) 21
Estimated Median = 169.5 + 5
9
= 169.5 + 2.22...
Mode
The Modal group is the one with the highest frequency, which is 175 - 179:
= 174.5 + 1.42...
Age Example
Age is a special case.
When we say "Sarah is 17" she stays "17" up until her eighteenth birthday.
She might be 17 years and 364 days old and still be called "17".
Example: The ages of the 112 people who live on a tropical island are grouped as
follows:
Age Number
0-9 20
10 - 19 21
20 - 29 23
30 - 39 16
40 - 49 11
50 - 59 10
60 - 69 7
70 - 79 3
80 - 89 1
A child in the first group 0 - 9 could be almost 10 years old. So the midpoint for this group
is 5 not 4.5
The midpoints are 5, 15, 25, 35, 45, 55, 65, 75 and 85
Similarly, in the calculations of Median and Mode, we will use the class boundaries 0, 10, 20 etc
Mean
0-9 5 20 100
10 - 19 15 21 315
20 - 29 25 23 575
30 - 39 35 16 560
40 - 49 45 11 495
50 - 59 55 10 550
60 - 69 65 7 455
70 - 79 75 3 225
80 - 89 85 1 85
Median
The Median is the mean of the ages of the 56th and the 57th people, so is in the 20 - 29 group:
L = 20 (the lower class boundary of the class interval containing the median)
n = 112
B = 20 + 21 = 41
G = 23
w = 10
(112/2) 41
Estimated Median = 20 + 10
23
= 20 + 6.52...
Mode
The Modal group is the one with the highest frequency, which is 20 - 29:
23 21
Estimated Mode = 20 + 10
(23 21) + (23 16)
= 20 + 2.22...
(n/2) B
Estimated Median = L + w
G
where:
fm fm-1
Estimated Mode = L + w
(fm fm-1) + (fm fm+1)
where:
Weighted Mean
Also called Weighted Average
Mean
When we do a simple mean (or average), we give equal weight to each number.
1+2+3+4 10
Mean = = = 2.5
4 4
Weights
We could think that each of those numbers has a "weight" of (because there are 4 numbers):
Mean = 1 + 2 + 3 + 4
= 0.25 + 0.5 + 0.75 + 1 = 2.5
Same answer.
Now let's change the weight of 3 to 0.7 , and the weights of the other numbers to 0.1 so the
total of the weights is still 1:
This weighted mean is now a little higher ("pulled" there by the weight of 3).
Example: Sam wants to buy a new camera, and decides on the following rating
system:
The Sany camera gets 8 (out of 10) for Image Quality, 6 for Battery Life and 7 for Zoom Range
The Conan camera gets 9 for Image Quality, 4 for Battery Life and 6 for Zoom Range
Example: Alex usually works 7 days a week, but sometimes just 1, 2, or 5 days.
Alex worked:
Weeks Days = 2 1 + 14 2 + 8 5 + 32 7
= 2 + 28 + 40 + 224 = 294
Weeks = 2 + 14 + 8 + 32 = 56
Divide:
294
Mean = = 5.25
56
But it is often better to use a table to make sure you have all the numbers correct:
Example (continued):
Let's use:
Weight Days
w x wx
2 1 2
14 2 28
8 5 40
32 7 224
w = 56 wx = 294
Divide wx by x:
294
Mean = = 5.25
56
wx
Weighted Mean =
w
In other words: multiply each weight w by its matching value x, sum that all up, and divide by
the sum of weights.
Summary
Weighted Mean: A mean where some values contribute more than others.
When the weights add to 1: just multiply each weight by the matching value and sum it
all up
Otherwise, multiply each weight w by its matching value x, sum that all up, and divide
by the sum of weights:
wx
Weighted Mean =
w
So the range is 9 3 = 6.
It is that simple!
The single value of 3616 makes the range large, but most values are around 10.
Quartiles
Quartiles are the values that divide a list of numbers into quarters:
Like this:
Example: 5, 7, 4, 4, 6, 2, 8
Quartile 1 (Q1) = 4
Quartile 2 (Q2), which is also the Median, = 5
Quartile 3 (Q3) = 7
Sometimes a "cut" is between two numbers ... the Quartile is the average of the two numbers.
Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
Q2 = (5+6)/2 = 5.5
Quartile 1 (Q1) = 3
Quartile 2 (Q2) = 5.5
Quartile 3 (Q3) = 7
Interquartile Range
The "Interquartile Range" is from Q1 to Q3:
Example:
Q3 Q1 = 7 4 = 3
Also:
So now we have enough data for the Box and Whisker Plot:
Q3 Q1 = 15 4 = 11
Percentiles
Percentile: the value below which a percentage of data falls.
If your height is 1.85m then "1.85m" is the 80th percentile height in that group.
In Order
Have the data in order, so you know which values are above and below.
To calculate percentiles of height: have the data in height order (sorted by height).
To calculate percentiles of age: have the data in age order.
And so on.
Grouped Data
When the data is grouped:
In the test 12% got D, 50% got C, 30% got B and 8% got A
In other words you did "as well or better than 77% of the class"
(Why take half of B? Because you shouldn't imagine you got the "Best B", or the "Worst B", just
an average B.)
Deciles
Deciles are similar to Percentiles (sounds like decimal and percentile together), as they split the
data into 10% groups:
The 1st decile is the 10th percentile (the value that divides the data so that 10% is below it)
The 2nd decile is the 20th percentile (the value that divides the data so that 20% is below it)
etc!
Example: (continued)
Quartiles
Another related idea is Quartiles , which splits the data into quarters:
Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
Q2 = (5+6)/2 = 5.5
Quartile 1 (Q1) = 3
Quartile 2 (Q2) = 5.5
Quartile 3 (Q3) = 7
The Quartiles also divide the data into divisions of 25%, so:
For 1, 3, 3, 4, 5, 6, 6, 7, 8, 8:
Estimating Percentiles
We can estimate percentiles from a line graph .
Example: Shopping
0 0
2 350
4 1100
6 2400
8 6500
10 8850
12 10,000
a) Estimate the 30th percentile (when 30% of the visitors had arrived).
b) Estimate what percentile of visitors had arrived after 11 hours.
First draw a line graph of the data: plot the points and join them with a smooth curve:
Draw a line horizontally across from 3,000 until you hit the curve, then draw a line vertically
downwards to read off the time on the horizontal axis:
b) To estimate the percentile of visits after 11 hours: draw a line vertically up from 11 until you
hit the curve, then draw a line horizontally across to read off the population on the horizontal
axis:
So the visits at 11 hours were about 9,500, which is the 95th percentile.
Mean Deviation
How far, on average, all values are from the middle.
Calculating It
Find the mean of all values ... use it to work out distances ... then find the mean of those
distances!
In three steps:
Like this:
3 + 6 + 6 + 7 + 8 + 11 + 15 + 16 72
Mean = = =9
8 8
3 6
6 3
6 3
7 2
8 1
11 2
15 6
16 7
6+3+3+2+1+2+6+7 30
Mean Deviation = = = 3.75
8 8
It tells us how far, on average, all values are from the middle.
In that example the values are, on average, 3.75 away from the middle.
Formula
The formula is:
Mean Deviation = |x |N
is Sigma, which means to sum up
|| (the vertical bars) mean Absolute Value, basically to ignore minus signs
x is each value (such as 3 or 16)
is the mean (in our example = 9)
N is the number of values (in our example N = 8)
Absolute Deviation
Each distance we calculate is called an Absolute Deviation, because it is the Absolute Value of
the deviation (how far from the mean).
To show "Absolute Value" we put "|" marks either side like this:
|-3| = 3
For any value x:
Absolute Deviation = |x - |
From our example, the value 16 has Absolute Deviation = |x - | = |16 - 9| = |7| = 7
Sigma
The symbol for "Sum Up" is (called Sigma Notation ), so we have:
Mean Deviation = |x |N
3 + 6 + 6 + 7 + 8 + 11 + 15 + 16 72
= = =9
8 8
x |x - |
3 6
6 3
6 3
7 2
8 1
11 2
15 6
16 7
|x - | = 30
|x - | 30
Mean Deviation = = = 3.75
N 8
Note: the mean deviation is sometimes called the Mean Absolute Deviation (MAD) because it is
the mean of the absolute deviations.
Here is an example (using the same data as on the Standard Deviation page):
Example: You and your friends have just measured the heights of your dogs (in
millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
x |x - |
600 206
470 76
170 224
430 36
300 94
|x - | = 636
So, on average, the dogs' heights are 127.2 mm from the mean.
A Useful Check
The deviations on one side of the mean should equal the deviations on the other side.
6+3+3+2+1 = 2+6+7
15 = 15
Likewise:
Example: Dogs
If they are not equal ... you may have made a msitake!
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
The formula is easy: it is the square root of the Variance. So now you ask, "What is the
Variance?"
Variance
The Variance is defined as:
Then for each number: subtract the Mean and square the result (the squared
difference).
Then work out the average of those squared differences. (Why Square?)
Example
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Answer:
Mean = 600 + 470 + 170 + 430 + 3005 = 19705 = 394
so the mean (average) height is 394 mm. Let's plot this on the chart:
To calculate the Variance, take each difference, square it, and then average the result:
So the Variance is 21,704
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation
= 21,704
= 147.32...
And the good thing about the Standard Deviation is that it is useful. Now we can show which
heights are within one Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what
is extra large or extra small.
Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell them!
Now try the Standard Deviation Calculator .
But if the data is a Sample (a selection taken from a bigger Population), then the calculation
changes!
All other calculations stay the same, including how we calculated the mean.
Example: if our 5 dogs are just a sample of a bigger population of dogs, we divide by 4 instead
of 5 like this:
Formulas
Here are the two formulas, explained at Standard Deviation Formulas if you want to know
more:
4 + 4 4 44 = 0
That looks good (and is the Mean Deviation ), but what about this case:
Oh No! It also gives a value of 4, Even though the differences are more spread out.
So let us try squaring each difference (and taking the square root at the end):
(4 2 ) (644) = 4
+ 42 + 42 + 424 =
(7 2 ) (904) = 4.74...
+ 12 + 62 + 224 =
That is nice! The Standard Deviation is bigger when the differences are more spread out ... just
what we want.
In fact this method is a similar idea to distance between points , just applied in a different way.
And it is easier to use algebra on squares and square roots than absolute values, which makes
the standard deviation easy to use in other areas of mathematics.
Return to Top
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
You might like to read this simpler page on Standard Deviation first.
2. Then for each number: subtract the Mean and square the result
The formula actually says all of that, and I will show you how.
In the formula above (the greek letter "mu") is the mean of all our values ...
9+2+5+4+12+7+8+11+9+3+7+4+12+5+4+10+9+6+9+420
= 14020 = 7
So:
=7
Step 2. Then for each number: subtract the Mean and square the result
So it says "for each value, subtract the mean and square the result", like this
Example (continued):
(9 - 7)2 = (2)2 = 4
(2 - 7)2 = (-5)2 = 25
(5 - 7)2 = (-2)2 = 4
(4 - 7)2 = (-3)2 = 9
(7 - 7)2 = (0)2 = 0
(8 - 7)2 = (1)2 = 1
But how do we say "add them all up" in mathematics? We use "Sigma":
Sigma Notation
We want to add up all the values from 1 to N, where N=20 in our case because there are 20
values:
Example (continued):
We already calculated (x1-7)2=4 etc. in the previous step, so just sum them up:
= 4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9 = 178
But that isn't the mean yet, we need to divide by how many, which is simply done by
multiplying by "1/N":
Example (continued):
Example (concluded):
= (8.9) = 2.983...
DONE!
Example: Sam has 20 rose bushes, but only counted the flowers on 6 of them!
and the "sample" is the 6 bushes that Sam counted the flowers of.
9, 2, 5, 4, 12, 7
But when we use the sample as an estimate of the whole population, the Standard Deviation
formula changes to this:
The formula for Sample Standard Deviation:
The important change is "N-1" instead of "N" (which is called "Bessel's correction").
The symbols also change to reflect that we are working on a sample instead of the whole
population:
The mean is now x (for sample mean) instead of (the population mean),
And the answer is s (for Sample Standard Deviation) instead of .
But that does not affect the calculations. Only N-1 instead of N changes the calculations.
So:
x = 6.5
Step 2. Then for each number: subtract the Mean and square the result
Example 2 (continued):
To work out the mean, add up all the values then divide by how many.
But hang on ... we are calculating the Sample Standard Deviation, so instead of dividing by how
many (N), we will divide by N-1
Example 2 (continued):
Example 2 (concluded):
s = (13.1) = 3.619...
DONE!
Comparing
When we used the whole population we got: Mean = 7, Standard Deviation = 2.983...
When we used the sample we got: Sample Mean = 6.5, Sample Standard Deviation = 3.619...
Our Sample Mean was wrong by 7%, and our Sample Standard Deviation was wrong by 21%.
Why Take a Sample?
Mostly because it is easier and cheaper.
Imagine you want to know what the whole country thinks ... you can't ask millions of people, so
instead you ask maybe 1,000 people.
"You don't have to eat the whole ox to know that the meat is tough."
This is the essential idea of sampling. To find out information about the population (such as
mean and standard deviation), we do not need to look at all members of the population; we only
need a sample.
Summary
Example: Travel Time (minutes): 15, 29, 8, 42, 35, 21, 18, 42, 26
The variable is Travel Time
Bivariate means "two variables", in other words there are two types of data
With bivariate data you have two sets of related data that you want to compare:
Example:
An ice cream shop keeps track of how much ice cream they sell versus the temperature on that
day.
16.4 $325
11.9 $185
15.2 $332
18.5 $406
22.1 $522
19.4 $412
25.1 $614
23.4 $544
18.1 $421
22.6 $445
17.2 $408
Now we can easily see that warmer weather and more ice cream sales are linked, but the
relationship is not perfect.
So with bivariate data we are interested in comparing the two sets of data and finding any
relationships.
Scatter Plots
A Scatter (XY) Plot has points that show the relationship between two sets of
data.
In this example, each dot shows one person's weight versus their height.
(The data is plotted on the graph as " Cartesian (x,y) Coordinates ")
Example:
The local ice cream shop keeps track of how much ice cream they sell versus the noon
temperature on that day. Here are their figures for the last 12 days:
14.2 $215
16.4 $325
11.9 $185
15.2 $332
18.5 $406
22.1 $522
19.4 $412
25.1 $614
23.4 $544
18.1 $421
22.6 $445
17.2 $408
It is now easy to see that warmer weather leads to more sales, but the relationship is not
perfect.
Try to have the line as close as possible to all points, and as many points above the line as
below.
Here we use linear extrapolation to estimate the sales at 29 C (which is higher than any
value we have).
Careful: Extrapolation can give misleading results because we are in "uncharted territory".
As well as using a graph (like above) we can create a formula to help us.
We can estimate a straight line equation from two points from the graph above
Let's estimate two points on the line near actual values: (12, $180) and (25, $610)
= $610 $18025 12
= $43013
= 33 (rounded)
Now put the slope and the point (12, $180) into the "point-slope" formula :
y y1 = m(x x1)
y = 33x 216
INTERPOLATING
EXTRAPOLATING
The values are close to what we got on the graph. But that doesn't mean they are more (or less)
accurate. They are all just estimates.
Don't use extrapolation too far! What sales would you expect at 0 ?
Note: we used linear (based on a line) interpolation and extrapolation, but there are many
other types, for example we could use polynomials to make curvy lines, etc.
Correlation
When the two sets of data are strongly linked together we say they have a High Correlation.
Like this:
(Learn More About Correlation )
Negative Correlation
Correlations can be negative, which means there is a correlation but one value goes down as the
other value increases.
Yearly
Birth
Country Production
Rate
per Person
Note: I tried to fit a straight line to the data, but maybe a curve would work better, what do you
think?
Outliers
"Outliers" are values that "lie outside" the other values.
When we collect data, sometimes there are values that are "far away" from the main group
of data ... what do we do with them?
A new coach has been working with the Long Jump team this month, and the athletes'
performance has changed.
Augustus can now jump 0.15m further, June and Carol can jump 0.06m further.
Augustus: +0.15m
Tom: +0.11m
June: +0.06m
Carol: +0.06m
Bob: + 0.12m
Sam: -0.56m
But is that fair? Can we just get rid of values we don't like?
What To Do?
You need to think "why is that value over there?"
We find out that Sam was feeling sick that day. Not the coach's fault at all.
When we remove outliers we are changing the data, it is no longer "pure", so we shouldn't just
get rid of the outliers without a good reason!
And when we do get rid of them, we should explain what we are doing and why.
So it seems that outliers have the biggest effect on the mean, and not so much on the median or
mode.
Hint: calculate the median and mode when you have outliers.
Correlation
When two sets of data are strongly linked together we say they have a High Correlation.
The value shows how good the correlation is (not how steep the line is), and if it is positive or
negative.
14.2 $215
16.4 $325
11.9 $185
15.2 $332
18.5 $406
22.1 $522
19.4 $412
25.1 $614
23.4 $544
18.1 $421
22.6 $445
17.2 $408
We can easily see that warmer weather and higher sales go together. The relationship is good
but not perfect.
In fact the correlation is 0.9575 ... see at the end how I calculated it.
It gets so hot that people aren't going near the shop, and sales start dropping.
The calculated correlation value is 0 (I worked it out), which means "no correlation".
But we can see the data does have a correlation: it follows a nice curve that reaches a peak
around 25 C.
But the linear correlation calculation is not "smart" enough to see this.
Our Ice Cream shop finds how many sunglasses were sold by a big store for each day and
compares them to their ice cream sales:
Does this mean that sunglasses make people want ice cream?
A few years ago a survey of employees found a strongpositive correlation between "Studying
an external course" and Sick Days.
How To Calculate
How did I calculate the value 0.9575 at the top?
I used "Pearson's Correlation". There is software that can calculate it, such as the CORREL()
function in Excel or LibreOffice Calc ...
... but here is how to calculate it yourself:
Let us call the two sets of data "x" and "y" (in our case Temperature is x and Ice Cream Sales
is y):
Here is how I calculated the first Ice Cream example (values rounded to 1 or 0 decimal places):
As a formula it is:
Where:
You probably won't have to calculate it like that, but at least you know it is not "magic", but
simply a routine set of calculations.
Other Methods
There are other ways to calculate a correlation coefficient, such as "Spearman's rank correlation
coefficient", but I prefer using a spreadsheet like above.
Probability
How likely something is to happen.
Many events can't be predicted with total certainty. The best we can say is how likely they are
to happen, using the idea of probability.
Tossing a Coin
heads (H) or
tails (T)
Throwing Dice
Probability
In general:
Number of ways it can happen: 1 (there is only 1 face with a "4" on it)
So the probability = 16
Example: there are 5 marbles in a bag: 4 are blue, and 1 is red. What is the
probability that a blue marble gets picked?
Probability Line
We can show probability on a Probability Line :
Example: toss a coin 100 times, how many Heads will come up?
Words
Some words have special meaning in Probability:
Tossing a coin, throwing dice, seeing what pizza people choose are all examples of experiments.
So the Sample Space is all 52 possible cards: {Ace of Hearts, 2 of Hearts, etc... }
"King" is not a sample point. As there are 4 Kings that is 4 different sample points.
Example Events:
Example: Alex wants to see how many times a "double" comes up when
throwing 2 dice.
The Event Alex is looking for is a "double", where both dice have the same number. It is made
up of these 6 Sample Points:
{3,4} No
{5,1} No
{2,2} Yes
{6,3} No
... ...
After 100 Experiments, Alex has 19 "double" Events ... is that close to what you would expect?
Probability Line
Probability is the chance that something will happen. It can be shown on a line:
As well as words, we can use numbers (such as fractions or decimals) to show the probability of
something happening:
Impossible is zero
Certain is one.
Between 0 and 1
The probability of an event will not be less than 0.
This is because 0 is impossible (sure that something will not happen).
The probability of an event will not be more than 1.
This is because 1 is certain that something will happen.
sedan or hatchback
There are 5 colors available:
Total Choices = 2 5 3 = 30
Independent or Dependent?
But it only works when all choices are independent of each other.
If one choice affects another choice (i.e. depends on another choice), then a simple
multiplication is not right.
the salesman says "You can't choose black for the hatchback" ... well then things change!
But you can still make your life easier with this calculation:
Choices = 53 + 43 = 15 + 12 = 27
Relative Frequency
How often something happens divided by all outcomes.
Example: Your team has won 9 games from a total of 12 games played:
All the Relative Frequencies add up to 1 (except for any rounding error).
35 used a car
42 took public transport
8 rode a bicycle
7 walked
0.38+0.46+0.09+0.08 = 1.01
A
single die
Interesting point
Many people think that one of these cubes is called "a dice". But no!
The plural is dice, but the singular is die. (i.e. 1 die, 2 dice.)
Are they all just as likely? Or will some happen more often?
The Experiment
Throw a die 60 times,
record the scores in a tally table.
You can record the results in this table using tally marks :
Total Frequency = 60
OK, Go!
... ...
Finished ...?
60 Throws
OK, why did I ask you to make 60 throws? Well, 6 throws is not enough for good results. 600
will give good results but is a lot of work. So 60 seems OK, and is also 10 lots of 6.
This graph and your graph should be similar, but they are not likely to be exactly the same, as
your experiment relied on chance, and the number of times you did it was fairly small.
If you did the experiment a very large number of times, you would get results much closer to the
theoretical ones.
Questions
Which face came up most often? ____
Do you think you would get the same results if you did this again? Yes / No
Probability
On the page Probability you will find a formula:
Example: Probability of a 2
Probability of a 2 = 16
Score Probability
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
Total = 1
Two dice
Interesting point
Many people think that one of these cubes is called "a dice". But no!
The plural is dice, but the singular is die: i.e. 1 die, 2 dice.
Example: imagine one die is colored red and the other is colored blue.
Question: If you throw 2 dice together and add the two scores:
Are they all just as likely? Or will some happen more often?
The Experiment
Throw two dice together 108 times,
add the scores together each time,
record the scores in a tally table.
Why 108? That seems a strange number to choose. I will explain later.
You can record the results in this table using tally marks :
Added
Tally Frequency
Scores
10
11
12
OK, Go!
...
...
Finished ...?
Here is a table of all possibile outcomes, and the totals. I have also shown what adds to 7
in bold.
1 2 3 4 5 6 7
2 3 4 5 6 7 8
Score
3 4 5 6 7 8 9
on the
Other
4 5 6 7 8 9 10
Die
5 6 7 8 9 10 11
6 7 8 9 10 11 12
You can see there is only 1 way to get 2, there are 2 ways to get 3, and so on.
Let us count the ways of getting each total and put them in a table:
Number of
Total
Ways to Get
Score
Score
2 1
3 2
4 3
5 4
6 5
7 6
8 5
9 4
10 3
11 2
12 1
Total = 36
Can you see the Symmetry in this table?
108 Throws
OK, why 108 throws? Well, 36 throws isn't enough for good results, 360 throws is be great but
takes a long time. So 108 (which is 3 lots of 36) seems just right.
Number of
Total
Ways to Get
Score
Score
2 3
3 6
4 9
5 12
6 15
7 18
8 15
9 12
10 9
11 6
12 3
Total = 108
Those are the theoretical values, as opposed to the experimental ones you got from your
experiment.
This graph and your graph should be quite similar, but they are not likely to be exactly the same,
as your experiment relied on chance, and the number of times you did it was fairly small.
If you did the experiment a very large number of times, you should get results much closer to
the theoretical ones.
And, by the way, we've now answered the question from near the beginning of the experiment:
Probability
On the page Probability you will find a formula:
Probability of a 2 = 136
Total
Probability
Score
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
Total = 1
A man (Georges-Louis Leclerc, the Count of Buffon, see " Buffon's Needle ") started thinking
about this and worked out how to calculate the probability .
Steps
Measure the diameter of your coin: ____ mm
Also measure the spacing of your grid (it may not print at exactly 30mm): ____ mm
Put your sheet of paper on a flat surface such as a table top or the floor.
From a height of about 5cm, drop the coin onto the paper and record whether it lands:
A: Completely inside a square (not touching any grid lines)
The exact height from which you drop the coin is not important, but don't drop it so close to the
paper that you are cheating!
If the coin rolls completely off the paper, then do not count that turn.
100 Times
Now we will drop the coin 100 times, but first ...
OK let's begin.
Drop the coin 100 times and record A (does not touch a line) or B (touches a line) using Tally
Marks :
Place your coin on your grid (like above), and then put a mark on the paper where the center of
the coin is (just a rough estimate will do).
See how the coin's center is one radius r away from a line.
Make lots of "center marks" then draw a box connecting them all like below:
d = Coin's diameter (2 r)
When a coin's center is within the yellow box it won't touch any line.
The yellow box is smaller than the grid by two radiuses (= one diameter) of the coin.
So what are the areas?
The above calculation was for a 30 mm grid, but we can use S for grid size:
Yellow Box = (29-16.25)2 = 12.752 = 162 mm2 (to the nearest mm2)
So you should expect the coin to land not crossing a line of the grid approximately:
Now do the calculations for your own grid size and coin size.
"A" (%):
"B" (%):
First calculate the theoretical value ... how does this affect the values for A and B?
Then do the experiment to see how close it gets.
You have done some geometry, and had some experience calculating areas and probabilities.
And you have seen the relationship between theory and reality.
A few hundred years ago people enjoyed betting on coins tossed on to the floor : would the coin
cross a line or not?
A man (Georges-Louis Leclerc, the Count of Buffon) started thinking about this and worked out
the probability .
Steps
Measure the spacing of your lines (it may not print at exactly 50mm): ____ mm
Measure the length of your match (must be less than the line spacing): ____ mm
Make sure your sheet of paper is on a flat surface such as a table top or the floor.
From a height of about 5cm, drop the match onto the paper and record whether it lands:
The exact height from which you drop the match is not important, but don't drop it so close to
the paper that you are cheating!
If the match rolls completely off the paper, then do not count that turn.
100 Times
Now we will drop the match 100 times, but first ...
OK let's begin.
Drop the match 100 times and record A (does not touch a grid line) or B (touches or crosses a
grid line) using Tally Marks :
(no touch)
(crosses)
Now draw a Bar Graph to illustrate your results. You can create one at Data Graphs (Bar, Line
and Pie) .
2Lxp
Where
We can do it too!
Example: Sam had a match of length 31 mm, and a 40 mm line spacing and 49
of 100 drops crossed the line
So Sam had:
L = 31
x = 40
p = 49/100 = 0.49
Now it's your turn. Fill in the following table using your own results:
And we get:
p 2Lx
Example: Alex had a match of length 36 mm, and a 50 mm line spacing.
So Alex had:
L = 36
x = 50
p 2 36 50 0.46...
So Alex should expect the match to cross the line (case B) 46 times out of 100
And you have seen the relationship between theory and reality.
Random Words
Probability and English ... what a mix!
Random Letters
You would think it was easy to create random words ... just pick letters randomly and put them
together, and voila! a random word.
tldkl oewkx dmwol vuptg hvwjk naqid avypr zwtip zgnzs bvdhd
muyfd ighgd xhlng oyecn vjnsl ssjrx gxald tukxj rvfoq yxzxq
It turns out that the words are not only nonsense, but quite hard to pronounce!
You see, the probability is very unlikely ... you would have to try lots of random combinations
before getting lucky.
Why? Well, English has around 200,000 words (228,000 in the Oxford English Dictionary
including many words no longer used) ... but how many different words can be made with just 5
letters?
Let us guess that there are 40,000 words in English that have 5 letters. So the probability of
making a real word just randomly would be:
Vowels
We can improve our success by insisting that a word have at least one vowel, since nearly every
word in English has one (except fly, by and a few others). Like this:
ectot gjaqv kuifg vzicu zspsu pdidb wqdis uerrs ucgej okimw
fnevz ewxko ljgew aglgo jpfoq dcytu uwkcj dzioy wekdx xuybk
But there are still lots of strange words like "zspsu" and "xuybk"
Letter Frequency
So, our next improvement is to use less of the letters like j,x,z and q and more of the letters like
e,t and s.
In fact the frequency of letters in the English Language is well known. Here is how many times
you would expect to see a letter in every 1,000 letters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
82 15 28 42 127 22 20 61 70 2 8 40 24 67 75 19 1 60 63 90 27 10 24 2 20 1
"e" is lkely to occur 127 times in every 1,000, or as a ratio 127/1000 = .127 (=12.7%)
"z" is lkely to occur only 1 time in every 1,000, or as a ratio 1/1000 = .001 (=0.1%)
So, by selecting letters based on that frequency (a bit like rolling a 1,000 sided die (dice) ,
where each die has 82 a's, 15 b's ... and only one z), we can get output like this:
elnao etgov segty laast aessn siuon oenha eaoas ncoot ctwka
dmswo dpuoh eewis ebdni laarm syucs idvos lhina igahh soyie
Still no real words, but some are close. And most of them can be pronounced. (Great names if
you are writing a science fiction novel!)
2-Letter Frequencies
We can take the idea of Letter Frequency one step further by asking
For example, if we already have a "t", the next letter is very likely to be an "h" (making "th").
Freq a b c d e f g h i j k l m n o p q r s t u v w x y z
t 238 41 727 11 3197 459 275 18 12 990 149 153 333 125 65 54
So, "h" occured 3197 times after a "t" ("th") ... but "b" never followed a "t"
OK, let us start with a "t", and let us say we choose an "h" to make "th", then next we would use
the "h"-row to choose another letter (maybe an "e" to make "the"), and so on ... well, here is a
sample:
The results are remarkable ... nonsense, but almost like some strange language.
In fact we are not just making random words now, we are making random sentences!
3 Letter Frequencies
How do 3 Letter Frequencies work?
Well, say I already have two letters (like "ei") ... we then:
look through the sample text for every time "ei" appears,
randomly choose one of those
look for the letter following "ei" (possibly "t").
then add the "t" to make "eit"
and start again using "it" (... always the last two letters)
Here is a sample:
Now, that looks good! By sampling from a real source we can get good results.
4 Letter Frequencies
Using the same method I used groups of 3 Letters to decide on the 4th letter and got:
5 Letter Frequencies
Rules
Different Lotteries have different rules.
Here we will use a typical lottery where the player chooses 6 different numbers out of 49.
Example:
You enter the lottery by buying a ticket and selecting your six numbers.
On Saturday they draw the lottery, and the winning numbers are:
Choosing Numbers
Seriously!
Instead of numbers they could be symbols, or colors, the lottery would still work.
So it doesn't matter what numbers you choose, the chances are all the same.
The people who run lotteries have strict rules to stop the "rigging" of results. But random chance
can sometimes produce strange results.
For example, using The Spinner I did 1000 spins for 10 numbers and came up with this:
Does this mean 7 will now come up more often, or less often? In fact it doesn't mean anything, 7
is just as likely as any number to get chosen.
Popular Numbers
But there is a trick! People have favorite numbers, so when popular numbers come up you are
sharing the winnings with lots of people.
Birthdays are popular choices, so people choose 1-12 and 1-31 more often. Also lucky numbers.
So maybe you should choose unpopular numbers so when you DO win you get lots of money.
(This assumes your lottery is one where prizes are shared among winners.)
Regret
Don't choose the same numbers every week. It's a trap! If you forget a week you then
worry that your numbers will come up, and this forces you to buy a ticket every week (even
when you have other more important things to do).
My advice:
Syndicates
A "Syndicate" is a group of people who all put in a little money so the group can buy lots of
tickets. The chance of winning goes up, but your payout each time is less (because you are
sharing).
Syndicates can be fun because they are sociable ... a way of making and keeping friendships.
Plus some syndicates like to spend small winnings on everyone going out for a meal together.
Another good reason for joining a syndicate is that your chances of winning go up (but what you
win goes down).
Think about it ... winning Ten Million would totally change your life, but winning One Million
would also improve your life a lot. You might prefer ten times the chance of winning the million.
You can use the Combinations and Permutations Calculator to work it out (use n=49, r=6, 'No'
for Is Order Important? and 'No' for Is Repetition allowed?)
49
C6 = 49!/(43! x 6!) = 13983816
1 Week
50 Years
(1 113983816)2600 = 0.999814...
Which means the probability of winning (after 50 Years) is: 1 0.999814... = 0.000186...
And you would have spent thousands for that small chance.
You could have bought a new TV, a computer and phone for that money.
Your Turn
Now your turn:
Probability: Complement
Complement of an Event: All outcomes that are NOT the event.
So the Complement of an event is all the other outcomes (not the ones we want).
And together the Event and its Complement make all possible outcomes.
Probability
Probability of an event happening = Number of ways it can happenTotal number of
outcomes
Number of ways it can happen: 1 (there is only 1 face with a "4" on it)
Total number of outcomes: 6 (there are 6 faces altogether)
So the probability = 16
The complement is shown by a little mark after the letter such as A' (or sometimes Ac or A):
P(A) + P(A') = 1
Event A is {5, 6}
P(A) = 26 = 13
P(A') = 46 = 23
P(A) + P(A') = 13 + 23 = 33 = 1
Example. Throw two dice. What is the probability the two scores
are different?
Different scores are like getting a 2 and 3, or a 6 and 1. It is quite a long list:
But the complement (which is when the two scores are the same) is only 6 outcomes:
P(A) = 1 P(A')
= 1 1/6
= 5/6
So in this case (and many others) it's easier to work out P(A') first, then find P(A)
The toss of a coin, throw of a dice and lottery draws are all examples of random events.
Events
When we say "Event" we mean one (or more) outcomes.
Example Events:
Choosing a "King" from a deck of cards (any of the 4 Kings) is also an event
Rolling an "even number" (2, 4 or 6) is an event
Independent Events
Events can be "Independent", meaning each event is not affected by any other events.
This is an important idea! A coin does not "know" that it came up heads before ... each toss of a
coin is a perfect isolated thing.
Example: You toss a coin three times and it comes up "Heads" each time ... what is the chance
that the next toss will also be a "Head"?
The chance is simply 1/2, or 50%, just like ANY OTHER toss of the coin.
What it did in the past will not affect the current toss!
Some people think "it is overdue for a Tail", but really truly the next toss of the coin is totally
independent of any previous tosses.
Saying "a Tail is due", or "just one more go, my luck is due" is called The Gambler's
Fallacy
Dependent Events
But some events can be "dependent" ... which means they can be affected by previous
events.
After taking one card from the deck there are less cards available, so the probabilities change!
If the 1st card was a King, then the 2nd card is less likely to be a King, as only 3 of the 51
cards left are Kings.
If the 1st card was not a King, then the 2nd card is slightly more likely to be a King, as 4 of
the 51 cards left are King.
Replacement: When we put each card back after drawing it the chances don't change, as the
events are independent.
Without Replacement: The chances will change, and the events are dependent.
Tree Diagrams
When we have Dependent Events it helps to make a " Tree Diagram "
You are off to soccer, and love being the Goalkeeper, but that depends who is the Coach today:
Sam is Coach more often ... about 6 of every 10 games (a probability of 0.6).
Start with the Coaches. We know 0.6 for Sam, so it must be 0.4 for Alex (the probabilities must
add to 1):
Then fill out the branches for Sam (0.5 Yes and 0.5 No), and then for Alex (0.3 Yes and 0.7 No):
Now it is neatly laid out we can calculate probabilities (read more at Tree Diagrams ).
Mutually Exclusive
Mutually Exclusive means we can't get both events at the same time.
Examples:
Turning left or right are Mutually Exclusive (you can't do both at the same time)
Heads and Tails are Mutually Exclusive
Kings and Aces are Mutually Exclusive
Like here:
You need to get a "feel" for them to be a smart and successful person.
The toss of a coin, throwing dice and lottery draws are all examples of random events.
Example: taking colored marbles from a bag: as you take each marble there are less marbles left
in the bag, so the probabilities change.
We call those Dependent Events, because what happens depends on what happened before
(learn more about this at Conditional probability ).
Example: You toss a coin and it comes up "Heads" three times ... what is the
chance that the next toss will also be a "Head"?
The chance is simply (or 0.5) just like ANY toss of the coin.
What it did in the past will not affect the current toss!
Some people think "it is overdue for a Tail", but really truly the next toss of the coin is totally
independent of any previous tosses.
Saying "a Tail is due", or "just one more go, my luck is due" is called The Gambler's
Fallacy
Of course your luck may change, because each toss of the coin has an equal chance.
Example: what is the probability of getting a "4" or "6" when rolling a die?
Total number of outcomes: 6 ("1", "2", "3", "4", "5" and "6")
As a decimal: 0.5
As a fraction: 1/2
As a percentage: 50%
Or sometimes like this: 1-in-2
So each toss of a coin has a chance of being Heads, but lots of Heads in a row is unlikely.
Example: Why is it unlikely to get, say, 7 heads in a row, when each toss of a
coin has a chance of being Heads?
Question 2: Given that we have just got 6 heads in a row, what is the probability thatthe
next toss is also a head?
You can have a play with the Quincunx to see how lots of independent effects can still have a
pattern.
Notation
We use "P" to mean "Probability Of",
Example: your boss (to be fair) randomly assigns everyone an extra 2 hours
work on weekend evenings between 4 and midnight.
Time: you want the 2 hours of 6-to-8, out of the 8 hours of 4-to-midnight):
And:
= 0.5 0.25
= 0.125
Or a 12.5% Chance
(Note: we could ALSO have worked out that you wanted 2 hours out of a total possible 16 hours,
which is 2/16 = 0.125. Both methods work here.)
Another Example
Imagine there are two groups:
A member of each group gets randomly chosen for the winners circle,
then one of those gets randomly chosen to get the big money prize:
So you have a 1/5 chance followed by a 1/2 chance ... which makes a 1/10 chance overall:
15 12 = 15 2 = 110
So your chance of winning the big money is 0.1 (which is the same as 1/10).
Coincidence!
Many "Coincidences" are, in fact, likely.
Example: you are in a room with 30 people, and find that Zach and Anna
celebrate their birthday on the same day.
Do you say:
Because you are comparing everyone to everyone else (not just one to many).
Example: Snap!
Did you ever say something at exactly the same time as someone else?
Wow, how amazing!
But you were probably sharing an experience (movie, journey, whatever) and so your thoughts
were similar.
... if you speak enough words together, they will eventually match up.
Can you think of other cases where a "coincidence" was simply a likely thing?
Conclusion
Probability is: (Number of ways it can happen) / (Total number of outcomes)
Dependent Events (such as removing marbles from a bag) are affected by previous
events
Independent events (such as a coin toss) are not affected by previous events
Not all coincidences are really unlikely (when you think about them).
Conditional Probability
How to handle Dependent Events
Life is full of random events! You need to get a "feel" for them to be a smart and successful
person.
Independent Events
Events can be " Independent ", meaning each event is not affected by any other events.
Example: Tossing a coin.
What it did in the past will not affect the current toss.
The chance is simply 1-in-2, or 50%, just like ANY toss of the coin.
Dependent Events
But events can also be "dependent" ... which means they can be affected by previous
events ...
The chance is 2 in 5
See how the chances change each time? Each event depends on what happened in the previous
event, and is called dependent.
"Replacement"
Note: if we replace the marbles in the bag each time, then the chances do not change and the
events are independent :
With Replacement: the events are Independent (the chances don't change)
Without Replacement: the events are Dependent (the chances change)
Tree Diagram
A Tree Diagram : is a wonderful way to picture what is going on, so let's build one for our
marbles example.
There is a 2/5 chance of pulling out a Blue marble, and a 3/5 chance for Red:
We can go one step further and see what happens when we pick a second marble:
If a blue marble was selected first there is now a 1/4 chance of getting a blue marble and a 3/4
chance of getting a red marble.
If a red marble was selected first there is now a 2/4 chance of getting a blue marble and a 2/4
chance of getting a red marble.
Now we can answer questions like "What are the chances of drawing 2 blue marbles?"
Did you see how we multiplied the chances? And got 1/10 as a result.
Notation
We love notation in mathematics! It means we can then use the power of algebra to play
around with the ideas. So here is the notation for probability:
In our marbles example Event A is "get a Blue Marble first" with a probability of 2/5:
P(A) = 2/5
And Event B is "get a Blue Marble second" ... but for that we have 2 choices:
So we have to say which one we want, and use the symbol "|" to mean "given":
In other words, event A has already happened, now what is the chance of event B?
P(B|A) = 1/4
And we write it as
For the first card the chance of drawing a King is 4 out of 52 (there are 4 Kings in a deck of 52
cards):
P(A) = 4/52
But after removing a King from the deck the probability of the 2nd card drawn is less likely to be
a King (only 3 of the 51 cards left are Kings):
P(B|A) = 3/51
And so:
70% of your friends like Chocolate, and 35% like Chocolate AND like Strawberry.
Sam is Coach more often ... about 6 out of every 10 games (a probability of 0.6).
Let's build a tree diagram . First we show the two possible coaches: Sam or Alex:
The probability of getting Sam is 0.6, so the probability of Alex must be 0.4 (together the
probability is 1)
Now, if you get Sam, there is 0.5 probability of being Goalie (and 0.5 of not being Goalie):
If you get Alex, there is 0.3 probability of being Goalie (and 0.7 not):
The tree diagram is complete, now let's calculate the overall probabilities. Remember that:
P(A and B) = P(A) x P(B|A)
(When we take the 0.6 chance of Sam being coach times the 0.5 chance that Sam will let you be
Goalkeeper we end up with an 0.3 chance.)
Check
One final step: complete the calculations and make sure they add to 1:
4 friends (Alex, Blake, Chris and Dusty) each choose a random number between 1 and
5. What is the chance that any of them chose the same number?
As a tree diagram :
If Alex and Blake did match, then Chris has only one number to compare to.
But if Alex and Blake did not match then Chris has two numbers to compare to.
For the top line (Alex and Blake did match) we already have a match (a chance of 1/5).
But for the "Alex and Blake did not match" there is now a 2/5 chance of Chris matching
(because Chris gets to match his number against both Alex and Blake).
And we can work out the combined chance by multiplying the chances it took to get there:
Following the "No, Yes" path ... there is a 4/5 chance of No, followed by a 2/5 chance of
Yes:
Following the "No, No" path ... there is a 4/5 chance of No, followed by a 3/5 chance of
No:
OK, that is all 4 friends, and the "Yes" chances together make 101/125:
Answer: 101/125
But here is something interesting ... if we follow the "No" path we can skip all the other
calculations and make our life easier:
1 - (24/125) = 101/125
That was a simple example using independent events (each toss of a coin is independent of the
previous toss), but tree diagrams are really wonderful for figuring out dependent events (where
an event depends on what happens in the previous event) like this example:
Example: Soccer Game
You are off to soccer, and love being the Goalkeeper, but that depends who is the Coach today:
Sam is Coach more often ... about 6 out of every 10 games (a probability of 0.6).
Let's build the tree diagram. First we show the two possible coaches: Sam or Alex:
The probability of getting Sam is 0.6, so the probability of Alex must be 0.4 (together the
probability is 1)
Now, if you get Sam, there is 0.5 probability of being Goalie (and 0.5 of not being Goalie):
If you get Alex, there is 0.3 probability of being Goalie (and 0.7 not):
The tree diagram is complete, now let's calculate the overall probabilities. This is done by
multiplying each probability along the "branches" of the tree.
Check
One final step: complete the calculations and make sure they add to 1:
Conclusion
So there you go, when in doubt draw a tree diagram, multiply along the branches and add the
columns. Make sure all probabilities add to 1 and you are good to go.
Examples:
Turning left and turning right are Mutually Exclusive (you can't do both at the same time)
Tossing a coin: Heads and Tails are Mutually Exclusive
Cards: Kings and Aces are Mutually Exclusive
Turning left and scratching your head can happen at the same time
Kings and Hearts, because we can have a King of Hearts!
Like here:
Probability
Let's look at the probabilities of Mutually Exclusive events. But first, a definition:
Mutually Exclusive
When two events (call them "A" and "B") are Mutually Exclusive it is impossible for them to
happen together:
P(A and B) = 0
In a Deck of 52 Cards:
So, we have:
Special Notation
Instead of "and" you will often see the symbol (which is the "Intersection" symbol used
in Venn Diagrams )
Instead of "or" you will often see the symbol (the "Union" symbol)
P(King Queen) = 0
P(King Queen) = (1/13) + (1/13) = 2/13
Then:
Which is written:
P(A B) = 0
P(A B) = 20% + 15% = 35%
Remembering
To help you remember, think:
A Final Example
16 people study French, 21 study Spanish and there are 30 altogether. Work out the
probabilities!
This is definitely a case of not Mutually Exclusive (you can study French AND Spanish).
And we get:
(16b) + b + (21b) = 30
37 b = 30
b=7
P(French) = 16/30
P(Spanish) = 21/30
P(French Only) = 9/30
P(Spanish Only) = 14/30
P(French or Spanish) = 30/30 = 1
P(French and Spanish) = 7/30
Yes, it works!
Summary:
Mutually Exclusive
A or B is the sum of A and B minus A and B: P(A or B) = P(A) + P(B) P(A and B)
Symbols
"My fruit salad is a combination of apples, grapes and bananas" We don't care
what order the fruits are in, they could also be "bananas, grapes and apples" or "grapes,
apples and bananas", its the same fruit salad.
"The combination to the safe is 472". Now we do care about the order. "724" won't
work, nor will "247". It has to be exactly 4-7-2.
In other words:
Permutations
There are basically two types of permutation:
Repetition is Allowed: such as the lock above. It could be "333".
No Repetition: for example the first three people in a running race. You can't be
first andsecond.
When a thing has n different types ... we have n choices each time!
nnn
(n multiplied 3 times)
More generally: choosing r of something that has n different types, the permutations are:
n n ... (r times)
(In other words, there are n possibilities for the first choice, THEN there are n possibilites for the
second choice, and so on, multplying each time.)
n n ... (r times) = nr
Example: in the lock above, there are 10 numbers to choose from (0,1,2,3,4,5,6,7,8,9) and we
choose 3 of them:
nr
In this case, we have to reduce the number of available choices each time.
So, our first choice has 16 possibilites, and our next choice has 15 possibilities, then 14, 13, etc.
And the total permutations are:
16 15 14 13 ... = 20,922,789,888,000
But maybe we don't want to choose them all, just 3 of them, so that is only:
16 15 14 = 3,360
In other words, there are 3,360 different ways that 3 pool balls could be arranged out of 16
balls.
But how do we write that mathematically? Answer: we use the " factorial function "
4! = 4 3 2 1 = 24
7! = 7 6 5 4 3 2 1 = 5,040
1! = 1
Note: it is generally agreed that 0! = 1. It may seem funny that multiplying no numbers together
gets us 1, but it helps simplify a lot of equations.
So, when we want to select all of the billiard balls the permutations are:
16! = 20,922,789,888,000
But when we want to select just 3 we don't want to multiply after 14. How do we do that? There
is a neat trick: we divide by 13!
16 15 14 13 12 ...
= 16 15 14 = 3,360
13 12 ...
n!(n r)!
where n is the number of things to choose from,
and we choose r of them
(No repetition, order matters)
Example: How many ways can first and second place be awarded to 10 people?
10! 10! 3,628,800
= = = 90
(10-2)! 8! 40,320
Notation
Instead of writing the whole formula, people use different notations such as these:
Example: P(10,2) = 90
Combinations
There are also two types of combinations (remember the order does not matter now):
Actually, these are the hardest to explain, so we will come back to this later.
This is how lotteries work. The numbers are drawn one at a time, and if we have the lucky
numbers (no matter what order) we win!
Going back to our pool ball example, let's say we just want to know which 3 pool balls are
chosen, not the order.
But many of those are the same to us now, because we don't care what order!
For example, let us say balls 1, 2 and 3 are chosen. These are the possibilites:
1 2 3
1 3 2
2 1 3
123
2 3 1
3 1 2
3 2 1
3! = 3 2 1 = 6
So we adjust our permutations formula to reduce it by how many ways the objects could be in
order (because we aren't interested in their order any more):
That formula is so important it is often just written in big parentheses like this:
Notation
n!r!(n r)!
Example
So, our pool ball example (now without order) is:
161514 3360
= = 560
321 6
In other words choosing 3 balls out of 16, or choosing 13 balls out of 16 have the same number
of combinations.
Pascal's Triangle
We can also use Pascal's Triangle to find the values. Go down to row "n" (the top row is 0), and
then along "r" places and the value there is our answer. Here is an extract showing row 16:
1 14 91 364 ...
Let us say there are five flavors of icecream: banana, chocolate, lemon, strawberry and
vanilla.
We can have three scoops. How many variations will there be?
Let's use letters for the flavors: {b, c, l, s, v}. Example selections include
(And just to be clear: There are n=5 things to choose from, and we choose r=3 of them.
Order does not matter, and we can repeat!)
Now, I can't describe directly to you how to calculate this, but I can show you a special
techniquethat lets you work it out.
Think about the ice cream being in boxes, we could say "move past the first box, then
take 3 scoops, then move along 3 more boxes to the end" and we will have 3 scoops
of chocolate!
So it is like we are ordering a robot to get our ice cream, but it doesn't change anything, we still
get what we want.
We can write this down as (arrow means move, circle means scoop).
OK, so instead of worrying about different flavors, we have a simpler question: "how many
different ways can we arrange arrows and circles?"
Notice that there are always 3 circles (3 scoops of ice cream) and 4 arrows (we need to move 4
times to go from the 1st to 5th container).
So (being general here) there are r + (n1) positions, and we want to choose r of them to have
circles.
This is like saying "we have r + (n1) pool balls and want to choose r of them". In other words
it is now like the pool balls question, but with slightly changed numbers. And we can write it like
this:
Interestingly, we can look at the arrows instead of the circles, and say "we have r +
(n1)positions and want to choose (n1) of them to have arrows", and the answer is the
same:
(3+51)! 7! 5040
= = = 35
3!(51)! 3!4! 624
But knowing how these formulas work is only half the battle. Figuring out how to interpret a real
world situation can be quite hard.
But at least now you know how to calculate all 4 variations of "Order does/does not matter" and
"Repeats are/are not allowed".
Wrong?
They each have a special name: "False Positive" and "False Negative":
Quality Control: a "false positive" is when a good quality item gets rejected, and a
"false negative" is when a poor quality item gets accepted. (A "positive" result
means there IS a defect.)
Medical screening: low-cost tests given to a large group can give many false
positives (saying you have a disease when you don't), and then ask you to get more
accurate tests.
But many people don't understand the true numbers behind "Yes" or "No", like in this example:
For people that really do have the allergy, the test says "Yes" 80% of the
time
For people that do not have the allergy, the test says "Yes" 10% of the time
("false positive")
Here it is in a table:
Question: If 1% of the population have the allergy, and Hunter's test says "Yes",
what are the chances that Hunter really has the allergy?
There are three good ways to solve this: "Imagine a 1000", "Tree Diagrams" or "Bayes'
Theorem", use which you prefer::
When trying to understand questions like this, just imagine a large group (say 1000) and play
with the numbers:
Of 1000 people, only 10 really have the allergy (1% of 1000 is 10)
The test is 80% right for people who have the allergy, so it will get 8 of those 10
right.
But 990 do not have the allergy, and the test will say "Yes" to 10% of them,
which is 99 people it says "Yes" to wrongly (false positive)
So out of 1000 people the test says "Yes" to (8+99) = 107 people
As a table:
Have allergy 10 8 2
So 107 people get a "Yes" but only 8 of those really have the allergy:
8 / 107 = about 7%
So, even though Hunter's test said "Yes", it is still only 7% likely that Hunter has a Cat Allergy.
Why so small? Well, the allergy is so rare that those who actually have it are
greatly outnumberedby those with a false positive.
As A Tree
First of all, let's check that all the percentages add up:
And the two "Yes" answers add up to 0.8% + 9.9% = 10.7%, but only 0.8% are correct.
Bayes' Theorem
P(A)P(B|A)
P(A|B) =
P(A)P(B|A) + P(not A)P(B|not A)
A computer virus spreads around the world, all reporting to a master computer.
The good guys capture the master computer and find that a million computers have been
infected (but don't know which ones).
No one can use the internet until their computer passes the "virus-free" test. The test is 99%
accurate (pretty good, right?) But 1% of the time it says you have the virus when you don't (a
"false positive").
Of 1 million with the virus 99% of them get correctly banned = about 1 million
But false positives are 999 million x 1% = about 10 million
So a total of 11 million get banned, but only 1 out of those 11 actually have the virus.
So if you get banned there is only a 9% chance you actually have the virus!
Conclusion
When dealing with false positives and false negatives (or other tricky probability questions) we
can use these methods:
Bayes' Theorem
Bayes can do magic!
An internet search for "movie automatic shoe laces" brings up "Back to the future"
Has the search engine watched the movie? No, but it knows from lots of other searches what
people areprobably looking for.
Bayes Theorem is a way of finding a probability when we know certain other probabilities.
P(A) P(B|A)
P(A|B) =
P(B)
It tells us how often A happens given that B happens, written P(A|B), when we know how often
B happens given that A happens, written P(B|A) , and how likely A and B are on their own.
When P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke,
then:
So the formula kind of tells us "forwards" when we know "backwards" (or vice versa)
Example: If dangerous fires are rare (1%) but smoke is fairly common (10%)
due to factories, and 90% of dangerous fires make smoke then:
P(Fire) P(Smoke|Fire) 1% x 90%
P(Fire|Smoke) = = = 9%
P(Smoke) 10%
We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.
P(Rain) P(Cloud|Rain)
P(Rain|Cloud) =
P(Cloud)
Remembering
First think "AB AB AB" then remember to group it like: "AB = A BA / B"
P(A) P(B|A)
P(A|B) =
P(B)
For those we have two possible cases for "A", such as Pass/Fail (or Yes/No etc)
Hunter says she is itchy. There is a test for Allergy to Cats, but this test is not always right:
For people that really do have the allergy, the test says "Yes"80% of the time
For people that do not have the allergy, the test says "Yes" 10%of the time ("false positive")
If 1% of the population have the allergy, and Hunter's test says "Yes", what are the
chances that Hunter really has the allergy?
We want to know the chance of having the allergy when test says "Yes", written P(Allergy|Yes)
Oh no! We don't know what the general chance of the test saying "Yes" is ...
... but we can calculate it by adding up those with, and those without the allergy:
1% have the allergy, and the test says "Yes" to 80% of them
99% do not have the allergy and the test says "Yes" to 10% of them
Which means that about 10.7% of the population will get a "Yes" result.
1% 80%
P(Allergy|Yes) = = 7.48%
10.7%
P(Allergy|Yes) = about 7%
This is the same result we got on False Positives and False Negatives .
In fact we can write a special version of the Bayes' formula just for things like this:
P(A)P(B|A)
P(A|B) =
P(A)P(B|A) + P(not A)P(B|not A)
When "A" has 3 or more cases we include them all in the bottom line:
P(A1)P(B|A1)
P(A1|B) =
P(A1)P(B|A1) + P(A2)P(B|A2) + P(A3)P(B|A3) + ...etc
Example: The Art Competition has entries from three painters: Pam, Pia and
Pablo
P(Pam)P(First|Pam)
P(Pam|First) =
P(Pam)P(First|Pam) + P(Pia)P(First|Pia) + P(Pablo)P(First|Pablo)
(15/30) 4%
P(Pam|First) =
(15/30) 4% + (5/30) 6% + (10/30) 3%
15 4% 0.6
P(Pam|First) = = = 50%
15 4% + 5 6% + 10 3% 0.6 + 0.3 + 0.3
A good chance!
Pam isn't the most successful artist, but she did put in lots of entries.
So now you know how search engines can guess what you want: they simply keep track of what
lots of people type in and what websites they eventually click on.
Then using Bayes they figure which ones are probably the best to show first.
Shared Birthdays
This is a great puzzle, and you get to learn a lot about probability along the way ...
There are 30 people in a room ... what is the chance that any two of them celebrate
their birthday on the same day? Assume 365 days in a year.
Some people think "there are 30 people, and 365 days, so 30/365 sounds about right, and
30/365 = 0.08..."
But no!
The probability is much higher. It is actually likely there are people who share a birthday in that
room.
As a tree diagram :
But there are now two cases to consider (called " Conditional Probability "):
If Alex and Billy did match, then Chris has only one number to compare to.
But if Alex and Billy did not match then Chris has two numbers to compare to.
For the top line (Alex and Billy did match) we already have a match (a chance of 1/5).
But for the "Alex and Billy did not match" there is a 2/5 chance of Chris matching (against both
Alex and Billy).
And we can work out the combined chance by multiplying the chances it took to get there:
Following the "No, Yes" path ... there is a 4/5 chance of No, followed by a 2/5 chance of
Yes:
Following the "No, No" path ... there is a 4/5 chance of No, followed by a 3/5 chance of
No:
OK, that is all 4 friends, and the "Yes" chances together make 101/125:
Answer: 101/125
But here is something interesting ... if we follow the "No" path we can skip all the other
calculations and make our life easier:
1 - (24/125) = 101/125
Example: what are the chances that with 6 people any of them celebrate their
Birthday in the same month? (Assume equal months)
The "no match" case for:
2 people is 11/12
3 people is (11/12) (10/12)
4 people is (11/12) (10/12) (9/12)
5 people is (11/12) (10/12) (9/12) (8/12)
6 people is (11/12) (10/12) (9/12) (8/12) (7/12)
1 - 0.22... = 0.78...
So, there is a 78% chance of any of them celebrating their Birthday in the same month
And now we can try calculating the "Shared Birthday" question we started with:
There are 30 people in a room ... what is the chance that any two of them celebrate
their birthday on the same day? Assume 365 days in a year.
It is just like the previous example! But bigger and more numbers:
(I did that calculation in a spreadsheet, but there are also mathematical shortcuts)
Footnote: In real life birthdays are not evenly spread out ... more babies are born in Spring. Also
Hospitals prefer to work on weekdays, not weekends, so there are more births early in the week.
And then there are leap years. But you get the idea.
Confidence Intervals
A Confidence Interval is a range of values we are fairly sure our true value lies in.
The 95% Confidence Interval (we show how to calculate it later) is:
175cm 6.2cm
This says the true mean of ALL men (if we could measure their heights) is likely to be between
168.8cm and 181.2cm.
The "95%" says that 95% of experiments like we just did will include the true mean, but5%
won't.
So there is a 1-in-20 chance (5%) that our Confidence Interval does NOT include the true mean.
Number of samples: n = 40
Mean: X = 175
Standard Deviation: s = 20
Step 2: decide what Confidence Interval we want. 90%, 95% and 99% are common choices.
Then find the "Z" value for that Confidence Interval here:
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
s
X Z
(n)
Where:
X is the mean
Z is the chosen Z-value from the table above
s is the standard deviation
n is the number of samples
And we have:
20
175 1.960
40
Which is:
175cm 6.20cm
Another Example
There are hundreds of apples on the trees, so you randomly choose just 30 and get these
results:
Mean: 86
Standard Deviation: 5
Let's calculate:
s
X Z
(n)
We know:
X is the mean = 86
Z is the Z-value = 1.960 (from the table above for 95%)
s is the standard deviation = 5
n is the number of samples = 30
5
86 1.960 = 86 1.79
30
So the true mean (of all the hundreds of apples) is likely to be between 84.21 and 87.79
True Mean
Now imagine we get to pick ALL the apples straight away, and get them ALL measured by the
packing machine (this is a luxury not normally found in statistics!)
Let's lay all the apples on the ground from smallest to largest:
Our result was not exact ... it is random after all ... but the true mean is inside our confidence
interval of 86 1.79 (in other words 84.21 to 87.79)
But the true mean might not be inside the confidence interval but 95% of the time it will!
95% of all "95% Confidence Intervals" will include the true mean.
Maybe we had this sample, with a mean of 83.5 and a Standard Deviation of 3.5:
Each apple is a green dot,
our samples are marked purple
That does not include the true mean. Expect that to happen 5% of the time for a 95%
confidence interval.
So how do we know if the sample we took is one of the "lucky" 95% or the unlucky 5%? Unless
we get to measure the whole population like above we simply don't know.
Example in Research
Here is Confidence Interval used in research on extra exercise for older people:
In other words the true benefit (for the wider population of men) has a 95% chance of being
between 0.88 and 0.97
* Note for the curious: "HR" is used in research and means "Hazard Ratio" where lower is better,
so an HR of 0.92 means the subjects were better off, and 1.03 means slightly worse off.
Standard Normal Distribution
It is all based on the idea of the Standard Normal Distribution , where the Z value is the "Z-
score"
For example the Z for 95% is 1.960, and here we see the range from -1.96 to +1.96 includes
95% of all values:
Conclusion
The Confidence Interval formula is
s
X Z
(n)
Where:
X is the mean
Z is the Z-value from the table below
s is the standard deviation
n is the number of samples
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
Chi-Square Test
Groups and Numbers
You research two groups and put them in categories single, married or divorced:
By doing some special calculations (explained later), we come up with a "p" value:
p value is 0.132
Now, p < 0.05 is the usual test for dependence. In this case p is greater than 0.05, so we
believe the variables are independent (ie not linked together).
In other words Men and Women probably do not have a different preference for Beach Holidays
or Cruises.
Imagine that the previous example was in fact two random samples of Men each time:
Men: Men:
Beach 209, Cruise 280 Beach 225, Cruise 248
Is it likely you would get such different results surveying Men each time?
Well the "p" value of 0.132 says that it really could happen every so often.
Surveys are random after all. We expect slightly different results each time, right?
So most people want to see a p value less than 0.05 before they are happy to say the results
show the groups have a different response.
P value is 0.043
In this case p < 0.05, so this result is thought of as being "significant" meaning we think the
variables are not independent.
In other words, because 0.043 < 0.05 we think that Gender is linked to Pet Preference (Men
and Women have different preferences for Cats and Dogs).
Just out of interest, notice that the numbers in our two examples are similar, but the resulting p-
values are very different: 0.132 and 0.043. This shows how sensitive the test is!
Why p<0.05 ?
It is just a choice! Using p<0.05 is common, but we could have chosen p<0.01 to be even
more sure that the groups behave differently, or any value really.
Calculating P-Value
So how do we calculate this p-value? We use the Chi-Square Test!
Chi-Square Test
Note: Chi Sounds like "Hi" but with a K, so say Chi-Square like "Ki square"
And Chi is the greek letter , so we can also write it 2
This test only works for categorical data (data in categories), such as Gender {Men, Women} or
color {Red, Yellow, Green, Blue} etc, but not numerical data such as height or weight.
The numbers must be large enough. Each entry must be 5 or more. In our example we have
values such as 209, 282, etc, so we are good to go.
Cat Dog
Cat Dog
Multiply each row total by each column total and divide by the overall total:
Cat Dog
Cat Dog
Which is:
Cat Dog
Chi-Square is 4.102
From Chi-Square to p
To get from Chi-Square to p-value is a difficult calculation, so either look it up in a table, or use
the Chi-Square Calculator .
But first you will need a "Degree of Freedom" (DF)
Example: DF = (2 1)(2 1) = 11 = 1
Result
p = 0.04283
Done!
Chi-Square Formula
This is the formula for Chi-Square:
You can try to place the line "by eye": aim to have a similar number of points above and below
the line and try to get the distance from each point to the line as small as possible.
But for better accuracy let's see how to calculate the line using Least Squares Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line :
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
To find the line of best fit for a group of (x,y) points:
b = y m(x)N
y = mx + b
Example
Let's have an example to see how to do it!
Example: Sam found how many hours of sunshine vs how many ice
creams were sold at the shop from Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
b = y m(x)N
= 41 1.5183 x 265
= 0.3049...
y = mx + b
y = 1.518x + 0.305
2 4 3.34 0.66
3 5 4.86 0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Nice fit!
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the
above equation to estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.
So, when we square each of those errors and add them all up, the total is as small as possible.
You can imagine each data point connected to a straight bar by springs:
Boing!
Outliers
Be careful! Least squares is sensitive to outliers . A strange value will pull the line towards it.
But the formulas (and the steps taken) will be very different!
Random Variables
A Random Variable is a set of possible values from a random experiment.
Let's give them the values Heads=0 and Tails=1 and we have a Random Variable "X":
In short:
X = {0, 1}
Note: We could choose Heads=100 and Tails=150 or other values if we want! It is our choice.
So:
Example: x + 2 = 6
Example: X = {0, 1, 2, 3}
X could be 0, 1, 2, or 3 randomly.
Capital Letters
We use a capital letter, like X or Y, to avoid confusion with the Algebra type of variable.
Sample Space
A Random Variable's set of values is the Sample Space.
Example: Throw a die once
X could be 1, 2, 3, 4, 5 or 6
Probability
We can show the probability of any one value using this style:
X = {1, 2, 3, 4, 5, 6}
In this case they are all equally likely, so the probability of any one is 1/6
P(X = 1) = 1/6
P(X = 2) = 1/6
P(X = 3) = 1/6
P(X = 4) = 1/6
P(X = 5) = 1/6
P(X = 6) = 1/6
But this time the outcomes are NOT all equally likely.
X = "number
of Heads"
HHH 3
HHT 2
HTH 2
HTT 1
THH 2
THT 1
TTH 1
TTT 0
Looking at the table we see just 1 case of Three Heads, but 3 cases of Two Heads, 3 cases of
One Head, and 1 case of Zero Heads. So:
P(X = 3) = 1/8
P(X = 2) = 3/8
P(X = 1) = 3/8
P(X = 0) = 1/8
The Random Variable is X = "The sum of the scores on the two dice".
1st Die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd
Die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
There are 6 6 = 36 of them, and the Sample Space = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
Let's count how often each value occurs, and work out the probabilities:
A Range of Values
We could also calculate the probability that a Random Variable takes on a range of values.
Example (continued) What is the probability that the sum of the scores is 5, 6, 7
or 8?
Solving
We can also solve a Random Variable equation.
X is the Random Variable "The sum of the scores on the two dice".
x is a value that X can take.
Continuous
Random Variables can be either Discrete or Continuous :
Summary
A Random Variable is a set of possible values from a random experiment.
In short:
X = {0, 1}
Note: We could choose Heads=100 and Tails=150 or other values if we want! It is our choice.
Continuous
Random Variables can be either Discrete or Continuous :
In our Introduction to Random Variables (please read that first!) we look at many examples of
Discrete Random Variables.
But here we look at the more advanced topic of Continuous Random Variables.
The Uniform Distribution has equal probability for all values of the Random variable
between a and b:
The probability of any value between a and b is p
We also know that p = 1/(b-a), because the total of all probabilities must be 1, so
p (ba) = 1
p = 1/(ba)
We can write:
Example: Old Faithful erupts every 91 minutes. You arrive there at random and
wait for 20 minutes ... what is the probability you will see it erupt?
If you waited the full 91 minutes you would be sure (p=1) to have seen it erupt.
But remember this is a random thing! It might erupt the moment you arrive, or any time in the
91 minutes.
Example (continued):
Let's use the "CDF" of the previous Uniform Distribution to work out the probability:
Other Distributions
The general name for any of these is probability density function or "pdf"
Start at the row for 0.4, and read along until 0.45: there is the value
0.1736
Summary
A Random Variable is a variable whose possible values are numerical outcomes of a random
experiment.
Let's give them the values Heads=0 and Tails=1 and we have a Random Variable "X":
So:
1 2 3 4 5 6
When we know the probability p of every value x we can calculate the Expected Value (Mean) of
X:
= xp
Example continued:
x 1 2 3 4 5 6
= xp = 0.1+0.2+0.3+0.4+0.5+3 = 4.5
Note: this is a weighted mean : values with higher probability have higher contribution to the
mean.
Variance: Var(X)
Example continued:
x 1 2 3 4 5 6
Standard Deviation:
= Var(X)
Example continued:
x 1 2 3 4 5 6
(Note that we run the table downwards instead of along this time.)
You plan to open a new McDougals Fried Chicken, and found these stats for
similar restaurants:
Percent Year's Earnings
30% $0
Using that as probabilities for your new restaurant's profit, what is the Expected Value and
Standard Deviation?
0.3 0 0 0
0.4 50 20 1000
p = 1 xp = 25 x2p = 3750
= xp = 25
= $25,000
= $56,000
So you might expect to make $25,000, but with a very wide deviation possible.
Let's try that again, but with a much higher probability for $50,000:
Example (continued):
Now with different probabilities (the $50,000 value has a high probability of 0.7 now):
0.1 0 0 0
0.7 50 35 1750
= xp = 45
In thousands of dollars:
= $45,000
= $47,000
And the standard deviation is a little smaller (showing that the values are more central.)
Continuous
Random Variables can be either Discrete or Continuous :
Here we looked only at discrete data, as finding the Mean, Variance and Standard Deviation of
continuous data needs Integration .
Summary
A Random Variable is a variable whose possible values are numerical outcomes of a random
experiment.
Tossing a Coin:
Throwing a Die:
We say the probability of a four is 1/6 (one of the six faces is a four).
And the probability of not four is 5/6 (five of the six faces are not a four)
HHH
HHT
HTH
HTT
THH
THT
TTH
TTT
"Two Heads" could be in any order: "HHT", "THH" and "HTH" all have two Heads (and one Tail).
Each outcome is equally likely, and there are 8 of them. So each has a probability of 1/8
Number of Probability of
outcomes we want each outcome
3 1/8 = 3/8
We can write this in terms of a Random Variable , X, = "The number of Heads from 3 tosses of a
coin":
P(X = 3) = 1/8
P(X = 2) = 3/8
P(X = 1) = 3/8
P(X = 0) = 1/8
Making a Formula
Now ... what are the chances of 5 heads in 9 tosses ... to list all outcomes (512) will a long
time!
n = total number
k = number we want
It is often called "n choose k" and you can read more
about it at Combinations and Permutations .
Note: the "!" means " factorial ", for example 4! = 1234 = 24
n! 3! 321
= = =3
k!(n-k)! 2!(3-2)! 21 1
(We knew that already, but now we have a formula for it.)
n! 9! 987654321
= = = 126
k!(n-k)! 5!(9-5)! 54321 4321
And for 9 tosses there are 29 = 512 total outcomes, so we get the probability:
Number of Probability of
outcomes we want each outcome
1 126
126 =
512 512
126 63
P(X=5) = = = 0.24609375
512 256
Bias!
So far the chances of success or failure have been equally likely.
But what if the coins are biased (land more on one side than another) or choices are not 50/50.
Example: You sell sandwiches. 70% of people choose chicken, the rest choose
pork.
This is just like the heads and tails example, but with 70/30 instead of 50/50.
Notice that the probabilities for "two chickens" all work out to be 0.147 , because we are
multiplying two 0.7s and one 0.3 in each case.
Can we get the 0.147 from a formula? What we want is "two 0.7s and one 0.3"
And
pk(1-p)(n-k)
Example: (continued)
So we get:
n! 3! 321
= = =3
k!(n-k)! 2!(3-2)! 21 1
And we get:
Number of Probability of
outcomes we want each outcome
3 0.147 = 0.441
OK. That was a lot of work for something we knew already, but now we can answer harder
questions.
Example: You say "70% choose chicken, so 7 of the next 10 customers should
choose chicken" ... what are the chances you are right?
p = 0.7
n = 10
k=7
So we get:
pk(1-p)(n-k) = 0.77(1-0.7)(10-7) = 0.77(0.3)(3) = 0.0022235661
n! 10!
=
k!(n-k)! 7!(10-7)!
10987654321
=
7654321 321
1098
= = 120
321
And we get:
Number of Probability of
outcomes we want each outcome
Moral of the story: even though the long-run average is 70%, don't expect 7 out of the next 10.
Putting it Together
Now we know how to calculate how many:
n!
k!(n-k)!
pk(1-p)(n-k)
n!
P(k out of n) = pk(1-p)(n-k)
k!(n-k)!
Important Notes:
Quincunx
Have a play with the Quincunx (then read Quincunx Explained ) to see the Binomial
Distribution in action.
Throw the Die
0 Twos
1 Two
2 Twos
3 Twos
4 Twos
n!
P(k out of n) = pk(1-p)(n-k)
k!(n-k)!
Summary: "for the 4 throws, there is a 48% chance of no twos, 39% chance of 1 two, 12%
chance of 2 twos, 1.5% chance of 3 twos, and a tiny 0.08% chance of all throws being a two
(but it still could happen!)"
Sports Bikes
Your company makes sports bikes. 90% pass final inspection (and 10% fail and need to be
fixed).
n = 4,
p = P(Pass) = 0.9
Like this:
Summary: "for the 4 next bikes, there is a tiny 0.01% chance of no passes, 0.36% chance of 1
pass, 5% chance of 2 passes, 29% chance of 3 passes, and a whopping 66% chance they all
pass the inspection."
There are (relatively) simple formulas for them. They are a little hard to prove, but they do
work!
= np
= 4 0.9 = 3.6
Variance: 2 = np(1-p)
= (np(1-p))
= (0.36) = 0.6
Note: we could also calculate them manually, by making a table like this:
0 0.0001 0 0
= 3.6
= (0.36) = 0.6
Summary
The General Binomial Probability Formula
n!
P(k out of n) = pk(1-p)(n-k)
k!(n-k)!
Mean value of X: = np
Variance of X: 2 = np(1-p)
Standard Deviation of X: = (np(1-p))
Normal Distribution
Data can be "distributed" (spread out) in different ways.
But there are many cases where the data tends to be around a central value with no bias left or
right, and it gets close to a "Normal Distribution" like this:
A Normal Distribution
heights of people
size of things produced by machines
errors in measurements
blood pressure
marks on a test
Quincunx
You can see a normal distribution being created by random chance!
Standard Deviations
The Standard Deviation is a measure of how spread out numbers are (read that page for details
on how to calculate it).
Example: 95% of students at school are between 1.1m and 1.7m tall.
Assuming this data is normally distributed can you calculate the mean and standard deviation?
95% is 2 standard deviations either side of the mean (a total of 4 standard deviations) so:
= 0.6m / 4
= 0.15m
It is good to know the standard deviation, because we can say that any value is:
Standard Scores
The number of standard deviations from the mean is also called the "Standard Score",
"sigma" or "z-score". Get used to those words!
It is also possible to calculate how many standard deviations 1.85 is from the mean
How many standard deviations is that? The standard deviation is 0.15m, so:
We can take any Normal Distribution and convert it to The Standard Normal Distribution.
26, 33, 65, 28, 34, 55, 25, 44, 50, 36, 26, 37, 43, 62, 35, 38, 45, 32, 28, 34
The Mean is 38.8 minutes, and the Standard Deviation is 11.4 minutes (you can copy and
paste the values into the Standard Deviation Calculator if you want).
Standard Score
Original Value Calculation
(z-score)
20, 15, 26, 32, 18, 28, 35, 14, 26, 22, 17
Most students didn't even get 30 out of 60, and most will fail.
The test must have been really hard, so the Prof decides to Standardize all the scores and only
fail people 1 standard deviation below the mean.
The Mean is 23, and the Standard Deviation is 6.6, and these are the Standard Scores:
-0.45, -1.21 , 0.45, 1.36, -0.76, 0.76, 1.82, -1.36 , 0.45, -0.15, -0.91
Only 2 students will fail (the ones who scored 15 and 14 on the test)
It also makes life easier because we only need one table (the Standard Normal Distribution
Table ), rather than doing calculations individually for each value of mean and standard
deviation.
In More Detail
Here is the Standard Normal Distribution with percentages for every half of a standard
deviation, and cumulative percentages:
Example: Your score in a recent test was 0.5 standard deviations above the average, how
many people scored lower than you did?
Between 0 and 0.5 is 19.1%
Less than 0 is 50% (left half of the curve)
In theory 69.1% scored less than you did (but with real data the percentage may be different)
Some values are less than 1000g ... can you fix that?
It is a random thing, so we can't stop bags having less than 1000g, but we can try to reduce
it a lot.
at 3 standard deviations:
From the big bell curve above we see that 0.1% are less. But maybe that is too small.
at 2.5 standard deviations:
Below 3 is 0.1% and between 3 and 2.5 standard deviations is 0.5%, together that is 0.1% +
0.5% = 0.6% (a good choice I think)
So let us adjust the machine to have 1000g at 2.5 standard deviations from the mean.
increase the amount of sugar in each bag (which changes the mean), or
make it more accurate (which reduces the standard deviation)
Or we can keep the same mean (of 1010g), but then we need 2.5 standard deviations to be
equal to 10g:
10g / 2.5 = 4g
Or perhaps we could have some combination of better accuracy and slightly larger average size,
I will leave that up to you!
0 to Z
Up to Z
Z onwards
0 to 1.62: 44.74%
Note: Click to Freeze/Unfreeze
Left/right to adjust
The Table
You can also use the table below. The table shows the area from 0 to Z.
Instead of one LONG table, we have put the "0.1"s running down, then the "0.01"s running
along. (Example of how to use is below)
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Start at the row for 0.4, and read along until 0.45: there is the value 0.1736
So 17.36% of the population are between 0 and 0.45 Standard Deviations from the
Mean.
Because the curve is symmetrical, the same table can be used for values going either direction,
so a negative 0.45 also has an area of 0.1736
Example: Percent of Population Z Between -1 and 2
At the row for 1.0, first column 1.00, there is the value 0.3413
From 0 to +2 is:
At the row for 2.0, first column 2.00, there is the value 0.4772
So 81.85% of the population are between -1 and +2 Standard Deviations from the
Mean.
Skewed Data
Data can be "skewed", meaning it tends to have a long tail on one side or the other:
People sometimes say it is "skewed to the left" (the long tail is on the left hand side)
It is perfectly symmetrical.
Positive Skew
And positive skew is when the long tail is on the positive side of the peak, and some
people say it is "skewed to the right".
As you can see it is positively skewed ... in fact the tail continues way past $100,000
Calculating Skewness
"Skewness" (the amount of skew) can be calculated, for example you could use the SKEW()
function in Excel or OpenOffice Calc.