You are on page 1of 4

1. What is a mean average?

2.
3.
4.
5.

6.
7.

8.

9.

a. Add up all the numbers and divide by the total number of


numbers.
What is a median average?
a. The midpoint in the series of numbers.
Why is the median often more informative than the mean?
a. The median is often overly affected by outliers, such as the top
1% of US earners, so the median is often more representative.
What is a mode average?
a. The value with the most numbers
What do the words population, sample, data, and statistic mean?
a. Population= the entire group we want to understand
b. Sample= subgroup of that population
c. Data= the info we learn from our sample
d. Statistic= a numerical indication of something about the
population.
What does statistical significance mean?
a. It describes a stat or set of stats as having 5% or smaller
probability of being caused by random variations.
Without distorting mathematics, or the meaning of statistical significance, I can
prove that there exists a vicious form of discrimination against people whose last
names start with a letter in the second half of the alphabet. This is called
alphabetism!! How can I prove that this form of discrimination exists?
a. With statistical significance, we have a 19 in 20 chance that the
results arent the result of chance and a 1 in 20 chance that the
results are just the result of random variations. If I take stats on
20 factors, its likely that 1 or the 20 will meet the standards for
statistical significance and I can make my argument for the
discrimination.
When calculating the standard deviation, why do we take the distances of each data
point from the mean and go through what looks like a circular process of first
squaring them and second taking the square root?
a. Half of the distances from the mean are negative numbers and
need to be expressed as positive.
Where does the 0.05, or 5%, standard of statistical significance originate?
a. A classic bell cure has some important propertiesi. 68% of the area under the bell curve will be within 1
standard deviation of the mean
ii. 95% will be within two standard deviations.
iii. So, only 5% is more than two standard deviations above or
below the mean.
b. That 1-in-20 chance is so unusual that it has become the
standard for statistical significance.

10.

How does the usual 0.05 standard of statistical significance cause problems when
we try to find the genetic causes of disorders? You can use an example, but explain
the basic problem.
a. Since humans have so many genes, when we check for
differences between people with and without a certain
condition, like autism, the standards of statistical significance
arent quite narrow enough. Random variation, in the long run,
will give us a statistically significant association between the
presence or absence of a certain gene and autism. If we have our
computers check 10,000 genes, we will find 500 of them
associations with autism at the 0.05 level of significance.
11. What is the difference between statistical significance and meaningful results? Use
an example to show that you understand this distinction.
a. When researchers use the word significant, they are referring
to statistical significance which indicates a 1 in 20 chance of
being caused by random variation. However this word in
normal vocabulary means important, which is not what a
researcher means. With the two halves of the alphabet, we
might discover from a study of 5,000 people that there was a
statistically significant difference of 1/16 of an inch between the
two groups. However, thats not necessarily important.
12. Why is it important to have a random sample from the population we are studying?
a. If a sample is not a random sample it will be biased or
unrepresentative of the population as a whole.
13. What are the two conditions which a random sample must meet?
a. Each member of the population must have an equal chance of
being selected.
b. Any group of equal size must have equal chance of being
selected.
14.What is a convenience sample?
a. A convenience sample is a sample of members of a population
who are easily available- this sample is bad because it is often
made up of those like us, and lacks the degree of variation found
in the group.
15. What is a cluster sample? (This has nothing to do with random clusters on graphs.)
a. This is a randomly selected group, such as all those living on a
randomly chosen block- this sample is bad because it often
violates the second condition.
16.What is stratified sampling?
a. A technique designed to ensure that representatives of each
significant group in the population end up in the final sample.
Sometimes a sample end up over or underrepresenting a group
and math techniques are needed to ensure each subgroups are
properly weighted in the final group.
17. What are often the two most difficult steps of statistical research?
a. Getting access to the entire population

b. Selecting a true random sample


18. What was the source of the error in the 1936 Literary Digest poll to predict the
winner of the presidential election?
a. The survey was done fairly early in the election process, so
people who responded were those with an intense interest. This
creates a response bias as most of those passionate people
disliked President Roosevelt. Thus, the prediction was that
Roosevelts opponent would win by a landslide.
19.What is the base rate and why is it important when someone announces a startling
piece of information?
a. Its the percentage of the entire population with a certain
characteristic, like alcoholism. When someone announces a
startling statistic, such as that 7% of Gonzaga students are
alcoholic, it may be surprising. However, when considering that
the base rate for Americans is also 7%, we realize that we are
more or less average.
20. What are the reasons we need a large sample?
a. Smaller samples will display random data clusters, which will
trigger the representative heuristic and result in false beliefs.
b. The clustering effect will eventually be eliminated with large
amounts of data, but it can persist for a long time.
c. You need a large sample before the nature of the data and the
tails of the bell curve is clear, because initially there will be
clustering and gaps there. It takes a long time to get a full curve
in the tails, since finding data in the tails is a rare event.
d. We also need data over a longer period of time, especially if we
are trying to predict rare events.
21. Dr. Wagner announced the deep principle: Rare events are rare. What practical
implications does this have for estimating the probability of severe disasters like
floods and earthquakes?
a. With data only from thirty years of records, our initial
predictions are essentially estimates. We can only have
confidence in predictions of how often rare events will occur if
those predictions are based on very large sets of data. However,
we often trust too much in these predictions and as a result, for
example, are convinced that the big one will hit California
soon, even though that prediction has come many times and has
failed each time.
22. On the financial disaster of 2008. You do not need to go into details on mortgage
backed securities and credit default swaps. What was the major mistake made by
default swaps. What was the major mistake made by intelligent people working in
major financial institutions? And how could those people make that major mistake,
i.e., how could they not anticipate the possibility of a decline in the housing market?
a. The cause was the creation of a bubble in real estate prices
which collapsed, triggering the recession. People in financial
institutions were relying on statistics that only extended from

1980 to 2005, and this was not a large enough data set. People
were led to believe that prices would go up and morgages would
be a secure investment, when in reality the opposite was true
for both.
23. Why do large numbers of data sets (extending from height, to IQ, to parking
patterns, to popcorn popping) fit into bell curves?
a. The bell curve is common because if we are measuring
something which has many causes operating independently of
one another, we get a bell curve. This tends to happen when
there are a high number of independent causes, regardless of
what thing were studying.
24. What numerical coefficients of correlation are considered to indicate weak,
moderate, and strong correlation?
a. Weak= between -0.3 and +0.3
b. Moderate= between -+0.3 and -+0.5
c. Strong= greater than -+0.5
d. We square the numbers to emphasize the differences.
25. What are the two forms of the gamblers fallacy and what basic principle do those
forms of the gamblers fallacy ignore?
a. One form= the belief that a coin which has come up heads 10
times in a row has a better chance than even of coming up tails
the next time because it has to follow the laws of chance and
produce even results.
b. Other form=the belief that a coin that has come up heads 10
times in a row has some special property which gives it a better
than even chance of coming up heads the next time around (e.g.
luck)
c. The gamblers fallacy ignores the basic principle that these
objects (coins, dice) dont have memories, and the results of
one trial are not influenced by the previous trial.
26. What fallacy is being named- correlation is not causation?
a. Post hoc ergo propter hoc.
27. At a critical point in calculating the coefficient of correlation we figure out the
standard deviations of every piece of data in terms of the factors we are testing for
correlation. Explain why we must do this?
a. Since we are dealing with two different kinds of data and need
to compare them, when we convert both to standard deviation
we can compare the variations in SD of one to the variations of
the other.

You might also like