Professional Documents
Culture Documents
Chapter 1
Data and Statistics
Statistics
Data
Data Sources
Descriptive Statistics
Statistical Inference
Computers and Statistical Analysis
Data Mining
Statistics
The term statistics can refer to numerical facts such as
averages, medians, percents, and index numbers that
help us understand a variety of business and economic
situations.
situations
Statistics can also refer to the art and science of
collecting, analyzing, presenting, and interpreting
data.
Applications in
Business and Economics
Accounting
Public accounting firms use statistical sampling
procedures when conducting
p
g audits for their clients.
Economics
Economists use statistical information in making
forecasts about the future of the economy or some
aspect of it.
Finance
Applications in
Business and Economics
Marketing
All the
th d
data
t collected
ll t d iin a particular
ti l study
t d are referred
f
d
to as the data set for the study.
Company
Dataram
EnergySouth
Keystone
LandCare
Psychemedics
Stock
Exchange
NQ
N
N
NQ
N
Annual
Earn/
/
Sales($M) Share($)
73.10
74.00
365.70
111.40
17.60
0.86
1.67
0.86
0.33
0.13
Data Set
Scales of Measurement
Scales of measurement include:
Nominal
Interval
Ordinal
Ratio
Scales of Measurement
Nominal
Data are labels or names used to identify an
attribute
b
off the
h element.
l
Scales of Measurement
Nominal
Example:
Students of a university are classified by the
school in which they are enrolled using a
nonnumeric label such as Business, Humanities,
Education, and so on.
Alternatively, a numeric code could be used for
the school variable ((e.g.
g 1 denotes Business,,
2 denotes Humanities, 3 denotes Education, and
so on).
Scales of Measurement
Ordinal
A nonnumeric label
l b l or numeric code
d may be
b used.
d
Scales of Measurement
Ordinal
Example:
p
Students of a university are classified by their
class standing using a nonnumeric label such as
Undergraduate, Post Graduate, or Doctoral.
Alternatively, a numeric code could be used for
the class standing variable (e.g. 1 denotes
Undergraduate, 2 denotes Post Graduate, and so on)..
Scales of Measurement
Interval
The data have the properties of ordinal data, and
the
h interval
i
lb
between observations
b
i
iis expressed
d iin
terms of a fixed unit of measure.
Interval data are always numeric.
numeric.
Scales of Measurement
Interval
Example:
Adi
Aditii has
h an SAT score off 1205
1205, while
hil Arun
A
has an SAT score of 1090. Aditi scored 115
points more than Arun.
Arun.
Scales of Measurement
Ratio
The data have all the properties of interval data
and
d the
h ratio off two values
l
is meaningful.
meaningful
f l.
Variables such as distance, height, weight, and time
use the ratio scale.
This scale must contain a zero value that indicates
that nothing exists for the variable at the zero point.
Scales of Measurement
Ratio
Example:
G
Ganeshs
h college
ll
record
d shows
h
36 credit
di h
hours
earned, while Govinds record shows 72 credit
hours earned. Govind has twice as many credit
hours earned as Ganesh.
Ganesh.
Categorical Data
Labels or names used to identify an attribute of
each element
Often referred to as qualitative data
Use either the nominal or ordinal scale of
measurement
Can be either numeric or nonnumeric
Appropriate statistical analyses are rather limited
Quantitative Data
Quantitative data indicate how many or how much:
discrete,
discrete, if measuring
g how many
y
continuous
continuous,, if measuring how much
Quantitative data are always numeric.
numeric.
O di
Ordinary
arithmetic
ih
i operations
i
are meaningful
i f l ffor
quantitative data.
10
Scales of Measurement
Data
Categorical
Numeric
Nominal
Quantitative
Non--numeric
Non
Ordinal
Nominal
Ordinal
Numeric
Interval
Ratio
Cross-Sectional Data
Cross-sectional data are collected at the same or
Crossapproximately the same point in time.
Example:
Example: data detailing the number of building
permits issued in February 2010 in each of the
ward of Coimbatore
11
12
Data Sources
Existing Sources
Data Sources
E l
Employee
records
d
name, address,
dd
social
i l security
it number
b
Production records
Inventory records
Sales records
Credit records
Customer profile
13
Data Sources
Data Sources
14
Cost of Acquisition
D t Errors
Data
E
Descriptive Statistics
Most of the statistical information in newspapers,
magazines, company reports, and other publications
consists of data that are summarized and presented
i a form
in
f
that
h is
i easy to understand.
d
d
15
78
69
74
97
82
93
72
62
88
98
57
89
68
68
101
75
66
97
83
79
52
75
105
68
105
99
79
77
71
79
80
75
65
69
69
97
72
80
67
62
62
76
109
74
73
16
Tabular Summary:
Frequency and Percent Frequency
F
Frequency
2
13
16
7
7
5
50
Percent
F
Frequency
4
26
(2/50)100
32
14
14
10
100
16
14
Frequency
12
10
8
6
4
2
Parts
Cost
($)
5059 6069 7079 8089 9099 100-110
17
Statistical Inference
Population the set of all elements of interest in a
particular study
b off the
h population
l i
Sample
l a subset
Statistical inference the process of using data obtained
from a sample to make estimates
and test hypotheses about the
characteristics of a population
Census collecting data for the entire population
Sample survey collecting data for a sample
18
2. A sample of 50
engine tunetune-ups
is examined.
19
Data Warehousing
Organizations obtain large amounts of data on a
daily basis by means of magnetic card readers, bar
code scanners, point of sale terminals, and touch
screen monitors.
i
Wal-Mart captures data on 20-30 million transactions
per day.
Visa processes 6,800 payment transactions per second.
Capturing, storing, and maintaining the data, referred
to as data warehousing,
warehousing is a significant undertaking.
undertaking
Data Mining
Analysis of the data in the warehouse might aid in
decisions that will lead to new strategies and higher
profits for the organization.
Using a combination of procedures from statistics,
mathematics, and computer science, analysts mine
the data to convert it into useful information.
The most effective data mining systems use automated
procedures to discover relationships in the data and
predict future outcomes, prompted by only general,
even vague, queries by the user.
20
A significant
i ifi
t iinvestment
t
t iin ti
time and
d money iis
required as well.
21
22
Descriptive Statistics:
Tabular and Graphical Presentations
Frequency Distribution
Relative Frequency Distribution
Percent Frequency Distribution
Bar Chart
Pie Chart
23
Frequency Distribution
A frequency distribution is a tabular summary of
data showing the frequency (or number) of items
in each of several nonnon-overlapping classes.
classes
The objective is to provide insights about the data
that cannot be quickly obtained by looking only at
the original data.
Frequency Distribution
Example: Marada Inn
Guests staying at Marada Inn were asked to rate the
quality
q
y of their accommodations as being
g excellent
excellent,,
above average,
average, average,
average, below average,
average, or poor
poor.. The
ratings provided by a sample of 20 guests are:
Below Average
Above Average
Above Average
A
Average
Above Average
Average
Above Average
Average
Above Average
Below Average
P
Poor
Excellent
Above Average
Average
Above Average
Above Average
Below Average
P
Poor
Above Average
Average
24
Frequency Distribution
25
Percent
Frequency
10
15
25 .10(100) = 10
45
5
100
1/20 = .05
26
Bar Chart
A bar chart is a graphical device for depicting
qualitative data.
On one axis (usually the horizontal axis)
axis), we specify
the labels that are used for each of the classes.
A frequency
frequency,, relative frequency,
frequency, or percent frequency
scale can be used for the other axis (usually the
vertical axis).
Using a bar of fixed width drawn above each class
label, we extend the height appropriately.
The bars are separated to emphasize the fact that each
class is a separate category.
Bar Chart
Marada Inn Quality Ratings
10
9
Frequency
8
7
6
5
4
3
2
1
Poor
Rating
27
Pareto Diagram
In quality control, bar charts are used to identify the
most important causes of problems.
When the bars are arranged in descending order of
height from left to right (with the most frequently
occurring cause appearing first) the bar chart is
called a Pareto diagram.
diagram.
This diagram is named for its founder, Vilfredo
Pareto, an Italian economist.
Pie Chart
The pie chart is a commonly used graphical device
for presenting relative frequency and percent
frequency
q
y distributions for categorical
g
data.
28
Pie Chart
Marada Inn Quality Ratings
Excellent
5%
Above
Average
45%
Poor
10%
Below
Average
15%
Average
g
25%
29
Frequency Distribution
Relative Frequency and
Percent Frequency Distributions
Dot Plot
Histogram
Cumulative Distributions
Ogive
Frequency Distribution
30
Frequency Distribution
78
69
74
97
82
93
72
62
88
98
57
89
68
68
101
75
66
97
83
79
52
75
105
68
105
99
79
77
71
79
80
75
65
69
69
97
72
80
67
62
62
76
109
74
73
Frequency Distribution
The three steps necessary to define the classes for a
frequency distribution with quantitative data are:
1 Determine the number of non
1.
non--overlapping classes.
classes
2. Determine the width of each class.
3. Determine the class limits.
31
Frequency Distribution
Frequency Distribution
32
Frequency Distribution
Ulti t l th
Ultimately,
the analyst
l t uses jjudgment
d
t tto
determine the combination of the number of
classes and class width that provides the best
frequency distribution for summarizing the data.
Frequency Distribution
33
Frequency Distribution
34
Dot Plot
35
Dot Plot
50
60
70
80
90
100
110
Cost ($)
Histogram
Another common graphical presentation of
quantitative data is a histogram
histogram..
The variable of interest is placed on the horizontal
axis.
A rectangle is drawn above each class interval with
its height corresponding to the intervals frequency,
frequency,
relative frequency,
frequency, or percent frequency.
frequency.
Unlike a bar graph, a histogram has no natural
separation
i between
b
rectangles
l off adjacent
dj
classes.
l
36
Histogram
16
Frequency
14
12
10
8
6
4
2
Parts
Relativee Frequency
.30
.25
.20
.15
15
.10
.05
0
37
Relativee Frequency
.35
.30
.25
.20
.15
15
.10
.05
0
Relativee Frequency
.30
.25
.20
.15
15
.10
.05
0
38
Relativee Frequency
.30
.25
.20
.15
15
.10
.05
0
Cumulative Distributions
Cumulative frequency distribution shows the
number of items with values less than or equal to the
upper limit of each class..
class
Cumulative relative frequency distribution shows
the proportion of items with values less than or
equal to the upper limit of each class.
Cumulative percent frequency distribution shows
the percentage of items with values less than or
equal to the upper limit of each class.
39
Cumulative Distributions
The last entry in a cumulative frequency distribution
always equals the total number of observations.
The last entry in a cumulative relative frequency
distribution always equals 1.00.
The last entry in a cumulative percent frequency
distribution always equals 100.
Cumulative Distributions
Cost ($)
< 59
< 69
< 79
< 89
< 99
< 109
Cumulative Cumulative
Cumulative
Relative
Percent
Frequency
Frequency
Frequency
2
.04
4
15
.30
30
31 2 + 13 .62 15/50 62 .30(100)
38
.76
76
76
45
.90
90
50
1.00
100
40
Ogive
Ogive
41
100
80
60
(89.5, 76)
40
20
50
60
70
80
90
100
110
Parts
Cost ($)
Descriptive Statistics:
Tabular and Graphical Presentations
42
Stem--and
Stem
and--Leaf Display
A stem
stem--and
and--leaf display shows both the rank order
and shape of the distribution of the data.
It is similar to a histogram
g
on its side, but it has the
advantage of showing the actual data values.
The first digits of each data item are arranged to the
left of a vertical line.
To the right of the vertical line we record the last
digit for each item in rank order.
Each line in the display is referred to as a stem
stem..
Each digit on a stem is a leaf.
leaf.
43
Stem--and
Stem
and--Leaf Display
Example: Hudson Auto Repair
Sample of Parts Cost ($) for 50 Tune
Tune--ups
91
71
104
85
62
78
69
74
97
82
93
72
62
88
98
57
89
68
68
101
75
66
97
83
79
52
75
105
68
105
99
79
77
71
79
80
75
65
69
69
97
72
80
67
62
62
76
109
74
73
44
Stem--and
Stem
and--Leaf Display
Example: Hudson Auto Repair
5
6
7
8
9
10
a stem
2
2
1
0
1
1
7
2
1
0
3
4
2
2
2
7
5
2
2
3
7
5
5
3
5
7
9
6
4
8
8
7 8 8 8 9 9 9
4 5 5 5 6 7 8 9 9 9
9
9
a leaf
Stretched StemStem-and
and--Leaf Display
If we believe the original stem
stem--and
and--leaf display has
condensed the data too much, we can stretch the
display vertically by using two stems for each
l di digit(s).
leading
di i ( )
Whenever a stem value is stated twice, the first value
corresponds to leaf values of 0 4, and the second
value corresponds to leaf values of 5 9.
45
Stretched StemStem-and
and--Leaf Display
Example: Hudson Auto Repair
5
5
6
6
7
7
8
8
9
9
10
10
2
7
2
5
1
5
0
5
1
7
1
5
2
6
1
5
0
8
3
7
4
5
2
7
2
5
2
9
2
8 8 8 9 9 9
2 3 4 4
6 7 8 9 9 9
3
7 8 9
9
Stem--and
Stem
and--Leaf Display
Leaf Units
A single digit is used to define each leaf.
46
11.7
9.4
9.1
10.2
11.0
8.8
a stem
stem--and
and--leaf display of these data will be
Leaf Unit = 0.1
8 6 8
9 1 4
10 2
11 0 7
1717
1974
1791
1682
1910
1838
a stem
stem--and
and--leaf display of these data will be
Leaf Unit = 10
16 8
17 1 9
18 0 3
19 1 7
The 82 in 1682
is rounded down
to 80 and is
represented as an 8.
47
Crosstabulation
A crosstabulation is a tabular summary of data for
two variables.
Crosstabulation can be used when:
one variable is qualitative and the other is
quantitative,
both variables are qualitative, or
both variables are quantitative.
The left and top margin labels define the classes for
the two variables.
48
Crosstabulation
Example: Finger Lakes Homes
The number of Finger Lakes homes sold for each
style and price for the past two years is shown below.
quantitative
categorical
variable
variable
Home Style
Price
Colonial Log Split A
A--Frame Total
Range
< $200,000
> $200,000
18
12
6
14
19
16
12
3
55
Total
30
20
35
15
100
45
Crosstabulation
Example: Finger Lakes Homes
Insights Gained from Preceding Crosstabulation
49
Crosstabulation
Example: Finger Lakes Homes
Frequency
distribution
for the
price range
variable
Home Style
Log Split A
A--Frame
Price
Range
Colonial
< $200,000
> $200,000
18
12
6
14
19
16
12
3
55
Totall
30
20
35
15
100
Total
45
50
Colonial
C l i l
< $200,000
> $200,000
32.73
26.67
Home Style
Log
L
Split
S lit A
A--Frame
F
10.91 34.55
31.11 35.56
T t l
Total
21.82
6.67
100
100
Colonial
C l i l
< $200,000
> $200,000
60.00
40.00
Total
100
Home Style
Log
L
Split
S lit A
A--Frame
F
30.00 54.29
70.00 45.71
100
100
80.00
20.00
100
51
52
53
54
55
Scatter Diagram
A Positive Relationship
Scatter Diagram
A Negative Relationship
56
Scatter Diagram
No Apparent Relationship
Scatter Diagram
Example: Panthers Football Team
The Panthers football team is interested in
investigating the relationship, if any, between
interceptions made and points scored.
x = Number of
Interceptions
1
3
2
1
3
y = Number of
Points Scored
14
24
18
17
30
57
Scatter Diagram
Nu
umber of Points Sco
ored
y
35
30
25
20
15
10
5
0
x
2
Number of Interceptions
58
30
25
20
15
10
5
0
0
1
2
3
Number of Interceptions
Quantitative Data
Graphical
Methods
Tabular
Methods
Bar Chart
Pie Chart
Frequency
Distribution
Rel. Freq. Dist.
% Freq. Dist.
q Dist.
Cum. Freq.
Cum. Rel. Freq.
Distribution
Cum. % Freq.
Distribution
Crosstabulation
Graphical
Methods
Dot Plot
Histogram
Ogive
StemStem-and
and-Leaf Display
Scatter
Diagram
59
Measures of Location
Measures of Variability
Measures of Location
Mean
Median
Mode
Percentiles
Quartiles
60
Mean
Sample Mean x
n
Number of
observations
in the sample
61
Population Mean
N
Number of
observations in
the population
Sample Mean
Example: Apartment Rents
Seventy efficiency apartments were randomly
sampled in a small college town. The monthly rent
prices for these apartments are listed below.
445
440
465
450
600
570
510
615
440
450
470
485
515
575
430
440
525
490
580
450
490
590
525
450
472
470
445
435
435
425
450
475
490
525
600
600
445
460
475
500
535
435
460
575
435
500
549
475
445
600
445
460
480
500
550
435
440
450
465
570
500
480
430
615
450
480
465
480
510
440
62
Sample Mean
Example: Apartment Rents
x
445
440
465
450
600
570
510
615
440
450
470
485
515
575
430
440
525
490
580
450
490
590
525
450
472
470
445
435
34,356
490.80
70
435
425
450
475
490
525
600
600
445
460
475
500
535
435
460
575
435
500
549
475
445
600
445
460
480
500
550
435
440
450
465
570
500
480
430
615
450
480
465
480
510
440
Median
The median of a data set is the value in the middle
when the data items are arranged in ascending order.
Whenever a data set has extreme values,, the median
is the preferred measure of central location.
The median is the measure of location most often
reported for annual income and property value data.
A few extremely large incomes or property values
can inflate the mean.
63
Median
For an odd number of observations:
26
18
27
12 14
27
19
7 observations
12
14
18
19
27
27
in ascending order
26
Median
For an even number of observations:
26
18
27
12 14
27
30
12
14
18
19
27
27 30
26
19
8 observations
in ascending order
64
Median
Example: Apartment Rents
Averaging the 35th and 36th data values:
Median = (475 + 475)/2 = 475
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Trimmed Mean
Another measure, sometimes used when extreme
values are present, is the trimmed mean.
mean.
It is obtained by
y deleting
gap
percentage
g of the
smallest and largest values from a data set and then
computing the mean of the remaining values.
For example, the 5% trimmed mean is obtained by
removing the smallest 5% and the largest 5% of the
data values and then computing the mean of the
remaining
g values.
65
Mode
The mode of a data set is the value that occurs with
greatest frequency.
The g
greatest frequency
q
y can occur at two or more
different values.
If the data have exactly two modes, the data are
bimodal
bimodal..
If the data have more than two modes, the data are
multimodal..
multimodal
Caution: If the data are bimodal or multimodal,
Excels MODE function will incorrectly identify a
single mode.
Mode
Example: Apartment Rents
450 occurred most frequently (7 times)
Mode = 450
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
66
Percentiles
A percentile provides information about how the
data are spread over the interval from the smallest
value to the largest value.
Admission test scores for colleges and universities
are frequently reported in terms of percentiles.
Percentiles
Arrange the data in ascending order.
Compute index i, the position of the pth percentile.
i = (p
(p/100)
/100)n
n
If i is not an integer, round up. The p th percentile
is the value in the i th position.
If i is an integer, the p th percentile is the average
of the values in positions i and i +1.
67
80th Percentile
Example: Apartment Rents
i = (p
(p/100)n
/100)n = (80/100)70 = 56
Averaging
g g the 56th and 57th data values:
80th Percentile = (535 + 549)/2 = 542
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
80th Percentile
Example: Apartment Rents
At least 80% of the
items take on a
value of 542 or less.
56/70 = .8 or 80%
14/70 = .2 or 20%
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
68
Quartiles
Quartiles are specific percentiles.
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = Median
Third Quartile = 75th Percentile
Third Quartile
Example: Apartment Rents
Third quartile = 75th percentile
i = ((pp/100)
/100)n
n = (75/100)70 = 52.5 = 53
Third quartile = 525
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
69
Measures of Variability
It is often desirable to consider measures of variability
(dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
70
Range
The range of a data set is the difference between the
largest and smallest data values.
p
measure of variability.
y
It is the simplest
It is very sensitive to the smallest and largest data
values.
Range
Example: Apartment Rents
Range = largest value - smallest value
Range = 615 - 425 = 190
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
71
Interquartile Range
The interquartile range of a data set is the difference
between the third quartile and the first quartile.
It is the range
g for the middle 50% of the data.
It overcomes the sensitivity to extreme data values.
Interquartile Range
Example: Apartment Rents
3rd Quartile (Q
(Q3) = 525
1st Q
Quartile (Q
(Q1)) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
72
Variance
The variance is a measure of variability that utilizes
all the data.
It is based on the difference between the value of
each observation (x
(xi) and the mean ( x for a sample,
for a population).
The variance is useful in comparing the variability
of two or more variables.
Variance
The variance is the average of the squared
differences between each data value and the mean.
The variance is computed as follows:
( xi x )
s
n 1
2
for a
sample
( xi )2
N
for a
population
73
Standard Deviation
The standard deviation of a data set is the positive
square root of the variance.
It is measured in the same units as the data
data,, making
it more easily interpreted than the variance.
Standard Deviation
The standard deviation is computed as follows:
s2
for a
sample
for a
population
74
Coefficient of Variation
The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
The coefficient of variation is computed as follows:
100 %
x
100 %
for a
sample
for a
population
Variance
s2
(x
x )2
n1
2,, 996.16
Standard Deviation
s s 2 2996.16
Coefficient of Variation
54.74
the standard
deviation is
about 11%
of the mean
54.74
100 % 11.15%
100 %
x
490.80
75
Distribution Shape
z-Scores
Chebyshevs
Chebyshev s Theorem
Empirical Rule
Detecting Outliers
76
n
xi x
Skewness
(n 1)(n 2) s
Relativ
ve Frequency
Skewness = 0
.30
.25
.20
.15
.10
.05
0
77
Relativ
ve Frequency
.35
Skewness = .31
.30
.25
.20
.15
.10
.05
0
Relativ
ve Frequency
Skewness = .31
.30
.25
.20
.15
.10
.05
0
78
Relativ
ve Frequency
.35
.30
.25
.20
.15
.10
.05
0
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
79
Relative Frequency
Skewness = .92
.30
.25
.20
.15
.10
10
.05
0
z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
value xi is from the mean.
zi
xi x
s
80
z-Scores
An observations zz-score is a measure of the relative
location of the observation in a data set.
A data value less than the sample mean will have a
z-score less than zero.
A data value greater than the sample mean will have
a zz-score greater than zero.
A data value equal to the sample mean will have a
z-score of zero.
z-Scores
Example: Apartment Rents
z-Score of Smallest Value (425)
z
xi x 425 490.80
490 80
1.20
s
54.74
-1.11
-0.93
-0.75
-0.38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0.17
1.45
2.27
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
81
Chebyshevs Theorem
At least (1 - 1/z
1/z2) of the items in any data set will be
within z standard deviations of the mean, where z is
any value greater than 1.
1
Chebyshevs theorem requires z > 1, but z need not
be an integer.
Chebyshevs Theorem
At least 75% of the data values must be
within z = 2 standard deviations of the mean.
At least 89% of the data values must be
within z = 3 standard deviations of the mean.
At least 94% of the data values must be
within z = 4 standard deviations of the mean.
82
Chebyshevs Theorem
Example: Apartment Rents
Let z = 1.5 with x = 490.80 and s = 54.74
At least (1 1/(1.5)2) = 1 0.44 = 0.56 or 56%
of the rent values must be between
Empirical Rule
When the data are believed to approximate a
bell--shaped distribution
bell
The empirical rule can be used to determine the
percentage of data values that must be within a
specified number of standard deviations of the
mean.
The empirical rule is based on the normal
distribution ((will be discussed later))
83
Empirical Rule
For data having a bellbell-shaped distribution:
68.26% of the values of a normal random variable
are within +/mean
+/- 1 standard deviation of its mean.
95.44% of the values of a normal random variable
are within +/+/- 2 standard deviations of its mean.
99.72% of the values of a normal random variable
are within +/+/- 3 standard deviations of its mean.
Empirical Rule
99.72%
95.44%
68
68.26%
26%
3
1
2
+ 3
+ 1
+ 2
84
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a zz--score less than -3 or greater
g
than +3 might be considered an outlier.
It might be:
an incorrectly recorded data value
a data value that was incorrectly included in the
data set
a correctly recorded data value that belongs in
the data set
Detecting Outliers
Example: Apartment Rents
The most extreme zz-scores are -1.20 and 2.27
Usingg ||zz| > 3 as the criterion for an outlier,, there
are no outliers in this data set.
Standardized Values for Apartment Rents
-1.20
-0.93
-0.75
-0.47
0 47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
0 38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
0 38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
0 34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
0 29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0 29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0 29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0 20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0 20
0.17
1.45
2.27
-0.93
-0.75
-0.47
-0.20
0 20
0.35
1.45
2.27
85
FiveFive-Number Summary
1
Smallest Value
First Quartile
Median
Third Quartile
Largest Value
86
FiveFive-Number Summary
Example: Apartment Rents
First Quartile = 445
Lowest Value = 425
Median = 475
Third Quartile = 525 Largest Value = 615
425
440
450
465
480
80
510
575
430
440
450
470
485
8
515
575
430
440
450
470
490
90
525
580
435
445
450
472
490
90
525
590
435
445
450
475
490
90
525
600
435
445
460
475
500
00
535
600
435
445
460
475
500
00
549
600
435
445
460
480
500
00
550
600
440
450
465
480
500
00
570
615
440
450
465
480
510
0
570
615
Box Plot
A box plot is a graphical summary of data that is
based on a fivefive-number summary.
A key to the development of a box plot is the
computation of the median and the quartiles Q1 and
Q3.
Box plots provide another way to identify outliers.
87
Box Plot
Example: Apartment Rents
A box is drawn with its ends located at the first and
third quartiles.
A vertical line is drawn in the box at the location of
the median (second quartile).
400 425 450 475 500 525 550 575 600 625
Q1 = 445
Q3 = 525
Q2 = 475
Box Plot
Limits are located (not drawn) using the interquartile
range (IQR).
Data outside these limits are considered outliers
outliers..
The locations of each outlier is shown with the
symbol * .
88
Box Plot
Example: Apartment Rents
Box Plot
Example: Apartment Rents
4 0 475 500
400
00 425 450
00 525
2 5500 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
89
Box Plot
Measures of Association
Between Two Variables
Thus far we have examined numerical methods used
to summarize the data for one variable at a time.
Often a manager or decision maker is interested in
the relationship between two variables.
variables.
Two descriptive measures of the relationship
between two variables are covariance and correlation
coefficient
coefficient..
90
Covariance
The covariance is a measure of the linear association
between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
Covariance
The covariance is computed as follows:
sxy
xy
( xi x )( yi y )
n 1
( xi x )( yi y )
N
for
samples
for
populations
91
Correlation Coefficient
Correlation is a measure of linear association and not
necessarily causation.
Just because two variables are highly correlated, it
does not mean that one variable is the cause of the
other.
Correlation Coefficient
The correlation coefficient is computed as follows:
rxy
sxy
sx s y
for
samples
xy
xy
x y
for
populations
92
Correlation Coefficient
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
relationship..
relationship
Values near +1 indicate a strong positive linear
relationship..
relationship
The closer the correlation is to zero, the weaker the
relationship.
93
277.6
259.5
269.1
267.0
255.6
272 9
272.9
69
71
70
70
71
69
10.65
-7.45
2.15
0.05
-11.35
5 95
5.95
-1.0
1.0
0
0
1.0
-1.0
10
-10.65
-7.45
0
0
-11.35
-5.95
5 95
Total -35.40
Sample Covariance
sxy
35 40
(x x )(y y ) 35.40
n1
61
7.08
sxy
sx s y
7.08
-.9631
(8.2192)(.8944)
94
Weighted Mean
Mean for Grouped Data
Variance for Grouped Data
Standard Deviation for Grouped Data
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean
mean..
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
95
Weighted Mean
where:
xi = value of observation i
wi = weight for observation i
Grouped Data
The weighted mean computation can be used to
obtain approximations of the mean, variance, and
standard deviation for the grouped data.
To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
We compute a weighted mean of the class midpoints
using the class frequencies as weights
weights..
Similarly,
y, in computing
p
g the variance and standard
deviation, the class frequencies are used as weights.
96
fM
i
Population Data
fi M i
N
where:
fi = frequency of class i
Mi = midpoint of class i
8
17
12
8
7
4
2
4
2
6
97
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429 5
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569 5
569.5
589.5
609.5
f iMi
3436 0
3436.0
7641.5
5634.0
3916.0
3566.5
2118.0
1099.0
2278 0
2278.0
1179.0
3657.0
34525.0
34, 525
34
493.21
70
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
s2
f i ( Mi x ) 2
n 1
2
fi ( Mi )
N
98
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429 5
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569 5
569.5
589.5
609.5
Mi - x
-63
63.7
7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76 3
76.3
96.3
116.3
(M i - x )2 f i (M i - x )2
4058 96 32471
4058.96
32471.71
71
1910.56 32479.59
562.16
6745.97
13.76
110.11
265.36
1857.55
1316.96
5267.86
3168.56
6337.13
5820 16 23280.66
5820.16
23280 66
9271.76 18543.53
13523.36 81140.18
208234.29
continued
Sample Variance
s2 = 208,234.29/(70
208 234 29/(70 1) = 33,017.89
017 89
99
Chapter 4
Introduction to Probability
Experiments, Counting Rules,
and Assigning Probabilities
Events and Their Probability
Some Basic Relationships
of Probability
Conditional Probability
Bayes Theorem
Uncertainties
Managers often base their decisions on an analysis
of uncertainties such as the following:
What are the chances that sales will decrease
if we increase prices?
What is the likelihood a new assembly method
will increase productivity?
What are the odds that a new investment will
be profitable?
100
Probability
Probability is a numerical measure of the likelihood
that an event will occur.
Probability values are always assigned on a scale
from 0 to 1.
A probability near zero indicates an event is quite
unlikely to occur.
A probability near one indicates an event is almost
certain to occur.
0
The event
is very
unlikely
to occur.
occur
.5
The occurrence
of the event is
just as likely as
it is unlikely.
unlikely
1
The event
is almost
certain
to occur.
occur
101
Statistical Experiments
In statistics, the notion of an experiment differs
somewhat from that of an experiment in the
physical
sciences.
p
y
In statistical experiments, probability determines
outcomes.
Even though the experiment is repeated in exactly
the same way, an entirely different outcome may
occur.
For this reason, statistical experiments are somesometimes called random experiments.
experiments.
102
Experiment Outcomes
Toss a coin
Inspection a part
Conduct a sales call
Roll a die
Play a football game
Head, tail
Defective, nonnon-defective
Purchase, no purchase
1, 2, 3, 4, 5, 6
Win, lose, tie
103
n1 = 4
n2 = 2
n1n2 = ((4)(2)
)( ) = 8
104
Tree Diagram
Example: Bradley Investments
Markley Oil
(Stage 1)
Collins Mining
(Stage 2)
Gain 8
Gain 10
Gain 8
Gain 5
Lose 2
Lose 2
Gain 8
Even
Lose 20
Gain 8
Lose 2
Lose 2
Experimental
Outcomes
(10, 8)
Gain $18,000
$8,000
(5, 8)
Gain $13,000
(5, -2)
Gain
$3,000
((0, 8))
Gain
$8,000
(0, -2)
Lose
$2,000
where:
h
N! = N(N 1)(N
1)(N 2) . . . (2)(1)
n! = n(n 1)(n
1)(n 2) . . . (2)(1)
0! = 1
105
where:
N! = N(N 1)(N
1)(N 2) . . . (2)(1)
n! = n(n 1)(n
1)(n 2) . . . (2)(1)
0! = 1
Assigning Probabilities
Basic Requirements for Assigning Probabilities
1. The probability assigned to each experimental
outcome must be between 0 and 11, inclusively
inclusively.
0 < P(Ei) < 1 for all i
where:
Ei is the ith experimental outcome
and P(Ei) is its probability
106
Assigning Probabilities
Basic Requirements for Assigning Probabilities
2. The sum of the probabilities for all experimental
1
outcomes must equal 1.
P(E1) + P(E2) + . . . + P(En) = 1
where:
n is the number of experimental outcomes
Assigning Probabilities
Classical Method
Assigning probabilities based on the assumption
off equally
ll likely
lik l outcomes
t
Relative Frequency Method
Assigning probabilities based on experimentation
or historical data
S bj ti Method
Subjective
M th d
Assigning probabilities based on judgment
107
Classical Method
Example: Rolling a Die
If an experiment has n possible outcomes, the
classical method would assign a probability of 1/n
1/n
to each outcome.
Experiment: Rolling a die
Sample Space: S = {1, 2, 3, 4, 5, 6}
Probabilities: Each sample point has a
1/6 chance of occurring
Number
of Days
4
6
18
10
2
108
Number
of Days
4
6
18
10
2
40
Probability
.10
.15
.45
45
4/40
.25
.05
1.00
Subjective Method
When economic conditions and a companys
circumstances change rapidly it might be
inappropriate to assign probabilities based solely on
hi t i l d
historical
data.
t
We can use any data available as well as our
experience and intuition, but ultimately a probability
value should express our degree of belief that the
experimental outcome will occur.
The best p
probability
y estimates often are obtained by
y
combining the estimates from the classical or relative
frequency approach with the subjective estimate.
109
Subjective Method
Example: Bradley Investments
An analyst made the following probability estimates.
N
Nett G
Gain
i or Loss
L
P
Probability
b bilit
E
Exper.
Outcome
O t
(10, 8)
.20
$18,000 Gain
(10, 2)
.08
$8,000 Gain
(5, 8)
.16
$13,000 Gain
(5, 2)
.26
$3,000 Gain
(0, 8)
.10
$8,000 Gain
(0, 2)
$2,000 Loss
.12
(20, 8)
$12,000 Loss
.02
(20, 2)
$22,000 Loss
.06
110
111
Complement of an Event
The complement of event A is defined to be the event
consisting of all sample points that are not in A.
The complement of A is denoted by Ac.
Event A
Ac
Sample
Space S
Venn
Diagram
112
Event A
Event B
Sample
Space S
113
Event A
Event B
Sample
Space S
Intersection of A and B
114
Addition Law
The addition law provides a way to compute the
probability of event A, or B, or both A and B occurring.
The law is written as:
P(A B
B) = P(A) + P(B) P(A B
Addition Law
Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M C
C = Markley Oil Profitable
or Collins Mining Profitable
We know: P(M) = .70, P(C) = .48, P(M C
C) = .36
Thus: P(M C) = P(M) + P(C
P(C) P(M C)
= .70
70 + .48
48 .36
36
= .82
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
115
Event A
E nt B
Event
Sample
Space S
There is no need to
include
P(A B
116
Conditional Probability
The probability of an event given that another event
has occurred is called a conditional probability
probability..
The conditional probability of A given B is denoted
by P(A|B).
A conditional probability is computed as follows :
Conditional Probability
Example: Bradley Investments
Event M = Markley Oil Profitable
E
Event
C = Collins
C lli Mining
Mi i Profitable
P fi bl
= Collins Mining Profitable
given Markley Oil Profitable
We know: P(M C
C) = .36, P(M) = .70
Thus:
117
Multiplication Law
The multiplication law provides a way to compute the
probability of the intersection of two events.
The law is written as:
P(A B
B) = P(B)P(A|B)
Multiplication Law
Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M C
C = Markley Oil Profitable
and Collins Mining Profitable
We know: P(M) = .70, P(C|M) = .5143
Thus: P(M C) = P(M)P(M|C
M|C))
= (.70)(.5143)
= .36
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
118
Markley Oil
Total
Profitable (M)
.36
.34
.70
.12
.18
.30
Total
.48
.52
1.00
Joint Probabilities
(appear in the body
of the table)
Marginal Probabilities
(appear in the margins
of the table)
Independent Events
If the probability of event A is not changed by the
existence of event B, we would say that events A
and B are independent
independent.
p
.
Two events A and B are independent if:
P(A|B) = P(A)
or
P(B|A) = P(B)
119
Multiplication Law
for Independent Events
The multiplication law also can be used as a test to see
if two events are independent.
The law is written as:
P(A B
B) = P(A)P(B)
Multiplication Law
for Independent Events
Example: Bradley Investments
Event M = Markley Oil Profitable
E
Event
t C = Collins
C lli Mining
Mi i Profitable
P fit bl
Are events M and C independent?
Does
DoesP(M C) = P(M)P(C) ?
We know: P(M C) = .36, P(M) = .70, P(C) = .48
But: P(M)P(C) = (.70)(.48) = .34, not .36
H
Hence:
M and
d C are nott independent.
i d
d t
120
Bayes Theorem
Often we begin probability analysis with initial or
prior probabilities.
probabilities.
Then, from a sample,
report,
or a p
product
p special
p
p
test we obtain some additional information.
Given this information, we calculate revised or
posterior probabilities.
probabilities.
Bayes theorem provides the means for revising the
prior probabilities.
Prior
Probabilities
New
Information
Application
of Bayes
Theorem
Posterior
Probabilities
121
Bayes Theorem
Example: L. S. Clothiers
A proposed shopping center will provide strong
competition
p
for downtown businesses like L. S.
Clothiers. If the shopping center is built, the owner
of L. S. Clothiers feels it would be best to relocate to
the shopping center.
The shopping center cannot be built unless a
zoning change is approved by the town council.
The planning board must first make a
recommendation, for or against the zoning change,
to the council.
Prior Probabilities
Example: L. S. Clothiers
Let:
A1 = town council approves the zoning change
A2 = town council disapproves the change
Using subjective judgment:
P(A
P(
A1) = .7, P(
P(A
A2) = .3
122
New Information
Example: L. S. Clothiers
The planning board has recommended against
the zoning
g change.
g Let B denote the event of a
negative recommendation by the planning board.
Given that B has occurred, should L. S. Clothiers
revise the probabilities that the town council will
approve or disapprove the zoning change?
Conditional Probabilities
Example: L. S. Clothiers
Past history with the planning board and the town
council indicates the following:
g
Hence:
P(B|A1) = .2
P(B|A2) = .9
P(BC|A1) = .8
P(BC|A2) = .1
123
Tree Diagram
Example: L. S. Clothiers
Town Council Planning Board
P(A1) = .7
Experimental
Outcomes
P(B|A1) = .2
P(A1 B) = .14
P(Bc|A1) = .8
P(B|A2) = .99
P(A2 B) = .27
27
P(Bc|A2) = .1
P(A2) = .3
Bayes Theorem
To find the posterior probability that event Ai will
occur given that event B has occurred, we apply
Bayes theorem.
theorem.
124
Posterior Probabilities
Example: L. S. Clothiers
Given the planning boards recommendation not
to approve
the zoning
pp
g change,
g we revise the prior
p
probabilities as follows:
= .34
Posterior Probabilities
Example: L. S. Clothiers
The planning boards recommendation is good
news for L. S. Clothiers. The p
posterior p
probability
y of
the town council approving the zoning change is .34
compared to a prior probability of .70.
125
(2)
(3)
(4)
(5)
Conditional
Prior
Events Probabilities Probabilities
Ai
P(Ai)
P(B|Ai)
A1
.7
.2
A2
.33
.99
1.0
126
(2)
(3)
(4)
(5)
Conditional
Prior
Joint
Events Probabilities Probabilities Probabilities
Ai
P(Ai)
P(B|Ai)
P(Ai B)
A1
.7
.2
.14
A2
.33
.99
27
.27
1.0
.7 x .2
127
128
(2)
(3)
(4)
(5)
Conditional
Prior
Joint
Events Probabilities Probabilities Probabilities
Ai
P(Ai)
P(B|Ai)
P(Ai B)
A1
.7
.2
.14
A2
.33
.99
.27
27
1.0
P(B) = .41
129
(2)
(3)
(4)
(5)
Prior
Joint
Posterior
Conditional
Events Probabilities Probabilities Probabilities Probabilities
P(B|Ai)
P(Ai)
P(Ai B)
P(Ai |B)
Ai
A1
.7
.2
.14
.3415
A2
.33
.99
.27
27
.6585
6585
P(B) = .41
1.0000
1.0
.14/.41
130