1st Set of Slides - 2

Data Analytics - 1
Chapter 1
Data and Statistics
Statistics
Applications in Business and Economics
Data
Data Sources
Descriptive Statistics
Statistical Inference
Computers and Statistical Analysis
Data Mining
Ethical Guidelines for Statistical Practice

Statistics
The term statistics can refer to numerical facts such as
averages, medians, percents, and index numbers that
help us understand a variety of business and economic
situations.
situations
Statistics can also refer to the art and science of
collecting, analyzing, presenting, and interpreting
data.
The goall off statistics

Th
t ti ti iis tto make
k us informed
i f
d users off
numerical information. It helps us make better
decisions.
Applications in
Business and Economics
Accounting
Public accounting firms use statistical sampling
procedures when conducting
p
g audits for their clients.
Economics
Economists use statistical information in making
forecasts about the future of the economy or some
aspect of it.
Finance
Financial advisors use price-earnings ratios and

dividend yields to guide their investment advice.
Applications in
Business and Economics
Marketing
Electronic point-of-sale scanners at retail checkout

counters
cou
te s aaree used to co
collect
ect data for
o a va
variety
ety o
of
marketing research applications.
Production
A variety of statistical quality control charts are used

to monitor the output of a production process.
Data and Data Sets

Data are the facts and figures collected, analyzed,

and summarized for presentation and interpretation.
All the
th d
data
t collected
ll t d iin a particular
ti l study
t d are referred
f
d
to as the data set for the study.
Elements, Variables, and Observations

Elements are the entities on which data are collected.
A variable is a characteristic of interest for the elements.
The
h set off measurements obtained
b
d for
f a particular
l
element is called an observation.
A data set with n elements contains n observations.
The total number of data values in a complete data
set is the number of elements multiplied by the
number of variables.
variables
Data, Data Sets,

Elements, Variables, and Observations
Variables
Element
Names
Company
Dataram
EnergySouth
Keystone
LandCare
Psychemedics
Stock
Exchange
NQ
N
N
NQ
N
Annual
Earn/
/
Sales($M) Share($)
73.10
74.00
365.70
111.40
17.60
0.86
1.67
0.86
0.33
0.13
Data Set
Scales of Measurement
Scales of measurement include:
Nominal
Interval
Ordinal
Ratio
The scale determines the amount of information

contained in the data.
The scale
Th
l indicates
i di
the
h data
d
summarization
i i and
d
statistical analyses that are most appropriate.
Nominal
Data are labels or names used to identify an
attribute
b
off the
h element.
l
A nonnumeric label or numeric code may

y be used.
Nominal
Example:
Students of a university are classified by the
school in which they are enrolled using a
nonnumeric label such as Business, Humanities,
Education, and so on.
Alternatively, a numeric code could be used for
the school variable ((e.g.
g 1 denotes Business,,
2 denotes Humanities, 3 denotes Education, and
so on).
Ordinal
The data have the properties of nominal data and

the order or rank of the data is meaningful.
meaningful.
A nonnumeric label
l b l or numeric code
d may be
b used.
d
Ordinal
Example:
p
Students of a university are classified by their
class standing using a nonnumeric label such as
Undergraduate, Post Graduate, or Doctoral.
Alternatively, a numeric code could be used for
the class standing variable (e.g. 1 denotes
Undergraduate, 2 denotes Post Graduate, and so on)..
Interval
The data have the properties of ordinal data, and
the
h interval
i
lb
between observations
b
i
iis expressed
d iin
terms of a fixed unit of measure.
Interval data are always numeric.
numeric.
O ne uniton this scale is the sam e size anyw here

along the scale,so values can be treated
m athem atically (e.g.,averaged),butzero on the scale
does notindicate a totalabsence ofthe variable
being m easured
Interval
Example:
Adi
Aditii has
h an SAT score off 1205
1205, while
hil Arun
A
has an SAT score of 1090. Aditi scored 115
points more than Arun.
Arun.
Ratio
The data have all the properties of interval data
and
d the
h ratio off two values
l
is meaningful.
meaningful
f l.
Variables such as distance, height, weight, and time
use the ratio scale.
This scale must contain a zero value that indicates
that nothing exists for the variable at the zero point.
Ratio
Example:
G
Ganeshs
h college
ll
record
d shows
h
36 credit
di h
hours
earned, while Govinds record shows 72 credit
hours earned. Govind has twice as many credit
hours earned as Ganesh.
Ganesh.
Categorical and Quantitative Data

Data can be further classified as being categorical
or quantitative.
The statistical analysis that is appropriate depends
on whether the data for the variable are categorical
or quantitative.
In general, there are more alternatives for statistical
analysis when the data are quantitative.
Categorical Data
Labels or names used to identify an attribute of
each element
Often referred to as qualitative data
Use either the nominal or ordinal scale of
measurement
Can be either numeric or nonnumeric
Appropriate statistical analyses are rather limited
Quantitative Data
Quantitative data indicate how many or how much:
discrete,
discrete, if measuring
g how many
y
continuous
continuous,, if measuring how much
Quantitative data are always numeric.
numeric.
O di
Ordinary
arithmetic
ih
i operations
i
are meaningful
i f l ffor
quantitative data.
10
Data
Categorical
Numeric
Nominal
Quantitative
Non--numeric
Non
Ordinal
Nominal
Ordinal
Numeric
Interval
Ratio
Cross-Sectional Data
Cross-sectional data are collected at the same or
Crossapproximately the same point in time.
Example:
Example: data detailing the number of building
permits issued in February 2010 in each of the
ward of Coimbatore
11
Time Series Data

Time series data are collected over several time
periods.
Example:
Example: data detailing the number of building
permits issued in Ettimadai village in each of
the last 36 months
Time Series Data
12
Data Sources
Existing Sources
Internal company records almost any department

Business database services India Business Insight
Database
Government agencies - Ministry of Labour and Employment
Industry associations NASSCOM
Independent organizations Centre for Monitoring
Indi n Economy
Indian
E n m (CMIE)
Internet more and more firms
Data Sources
Data Available From Internal Company Records

Record
Some of the Data Available
E l
Employee
records
d
name, address,
dd
social
i l security
it number
b
Production records
part number, quantity produced,

direct labor cost, material cost
part number, quantity in stock,
reorder level, economic order quantity
Inventory records
Sales records
Credit records
Customer profile
product number, sales volume, sales

volume by region
customer name, credit limit, accounts
receivable balance
age, gender, income, household size
13
Data Sources
Statistical Studies - Experimental

In experimental studies the variable of interest is
first identified.
identified Then one or more other variables
are identified and controlled so that data can be
obtained about how they influence the variable of
interest.
The largest experimental study ever conducted is
b li
believed
d to
t be
b the
th 1954 P
Public
bli H
Health
lth S
Service
i
experiment for the Salk polio vaccine. Nearly two
million U.S. children (grades 11- 3) were selected.
Data Sources
Statistical Studies - Observational

In observational (nonexperimental) studies no
attempt is made to control or influence the
variables of interest.
a survey is a good example
Studies of smokers and nonsmokers are
observational studies because researchers
do not determine or control
who will smoke and who will not smoke.
14
Data Acquisition Considerations

Time Requirement
Searching for information can be time consuming.

I f
Information
ti may no longer
l
b
be useful
f lb
by the
th time
ti
it
is available.
Cost of Acquisition
Organizations often charge for information even

when it is not their primary business activity.
D t Errors
Data
E
Using any data that happen to be available or were

acquired with little care can lead to misleading
information.
Descriptive Statistics
Most of the statistical information in newspapers,
magazines, company reports, and other publications
consists of data that are summarized and presented
i a form
in
f
that
h is
i easy to understand.
d
d
Such summaries of data, which may be tabular,

graphical, or numerical, are referred to as descriptive
statistics.
t ti ti
15
Example: Hudson Auto Repair

The manager of Hudson Auto would like to have a
better understanding of the cost of parts used in the
g
tune-ups
p p
performed in her shop.
p She examines
engine
50 customer invoices for tune-ups. The costs of parts,
rounded to the nearest dollar, are listed on the next
slide.

Sample of Parts Cost ($) for 50 Tune-ups

91
71
104
85
62
78
69
74
97
82
93
72
62
88
98
57
89
68
68
101
75
66
97
83
79
52
75
105
68
105
99
79
77
71
79
80
75
65
69
69
97
72
80
67
62
62
76
109
74
73
16
Tabular Summary:
Frequency and Percent Frequency
Example: Hudson Auto

Parts
C t ($)
Cost
50--59
50
60--69
60
70--79
70
80--89
80
90--99
90
100--109
100
F
Frequency
2
13
16
7
7
5
50
Percent
F
Frequency
4
26
(2/50)100
32
14
14
10
100
Graphical Summary: Histogram

Example: Hudson Auto
18
Tune--up Parts Cost

Tune
16
14
Frequency
12
10
8
6
4
2
Parts
Cost
($)
5059 6069 7079 8089 9099 100-110
17
Numerical Descriptive Statistics

The most common numerical descriptive statistic
is the average (or mean).
The average
g demonstrates a measure of the central
tendency, or central location, of the data for a variable.
Hudsons average cost of parts, based on the 50
tune-ups studied, is $79 (found by summing the
50 cost values and then dividing by 50).
Statistical Inference
Population the set of all elements of interest in a
particular study
b off the
h population
l i
Sample
l a subset
Statistical inference the process of using data obtained
from a sample to make estimates
and test hypotheses about the
characteristics of a population
Census collecting data for the entire population
Sample survey collecting data for a sample
18
Process of Statistical Inference

1. Population
consists of all tunetuneups. Average
A
costt off
parts is unknown.
unknown
4. The sample average

is used to estimate the
population average.
2. A sample of 50
engine tunetune-ups
is examined.
3. The sample data

provide a sample
average parts cost
of $79 per tunetune-up.
Computers and Statistical Analysis

Statisticians often use computer software to perform
the statistical computations required with large
amounts of data.
19
Data Warehousing
Organizations obtain large amounts of data on a
daily basis by means of magnetic card readers, bar
code scanners, point of sale terminals, and touch
screen monitors.
i
Wal-Mart captures data on 20-30 million transactions
per day.
Visa processes 6,800 payment transactions per second.
Capturing, storing, and maintaining the data, referred
to as data warehousing,
warehousing is a significant undertaking.
undertaking
Data Mining
Analysis of the data in the warehouse might aid in
decisions that will lead to new strategies and higher
profits for the organization.
Using a combination of procedures from statistics,
mathematics, and computer science, analysts mine
the data to convert it into useful information.
The most effective data mining systems use automated
procedures to discover relationships in the data and
predict future outcomes, prompted by only general,
even vague, queries by the user.
20
Data Mining Applications

The major applications of data mining have been
made by companies with a strong consumer focus
such as retail, financial, and communication firms.
Data mining is used to identify related products that
customers who have already purchased a specific
product are also likely to purchase (and then pop-ups
are used to draw attention to those related products).
As another example, data mining is used to identify
customers who should receive special discount offers
based on their past purchasing volumes.
Data Mining Requirements

Statistical methodology such as multiple regression,
logistic regression, and correlation are heavily used.
Also needed are computer science technologies

involving artificial intelligence and machine learning.
A significant
i ifi
t iinvestment
t
t iin ti
time and
d money iis
required as well.
21
Data Mining Model Reliability

Finding a statistical model that works well for a
particular sample of data does not necessarily mean
that it can be reliably applied to other data.
With the enormous amount of data available, the
data set can be partitioned into a training set (for
model development) and a test set (for validating
the model).
There is, however, a danger of over fitting the model
to the p
point that misleading
g associations and
conclusions appear to exist.
Careful interpretation of results and extensive testing
is important.
Ethical Guidelines for Statistical Practice

In a statistical study, unethical behavior can take a
variety of forms including:
Improper sampling
Inappropriate analysis of the data
Development of misleading graphs
Use of inappropriate summary statistics
Biased interpretation of the statistical results
You should strive to be fair, thorough, objective, and
neutral as you collect, analyze, and present data.
As a consumer of statistics, you should also be aware
of the possibility of unethical behavior by others.
22
Descriptive Statistics:
Tabular and Graphical Presentations

Summarizing Categorical Data

Summarizing Quantitative Data
Categorical data use labels or names
to identify categories of like items.
Quantitative data are numerical values
that indicate how much or how many.
Summarizing Categorical Data

Frequency Distribution
Relative Frequency Distribution
Percent Frequency Distribution
Bar Chart
Pie Chart
23
A frequency distribution is a tabular summary of
data showing the frequency (or number) of items
in each of several nonnon-overlapping classes.
classes
The objective is to provide insights about the data
that cannot be quickly obtained by looking only at
the original data.
Example: Marada Inn
Guests staying at Marada Inn were asked to rate the
quality
q
y of their accommodations as being
g excellent
excellent,,
above average,
average, average,
average, below average,
average, or poor
poor.. The
ratings provided by a sample of 20 guests are:
Below Average
Above Average
Above Average
A
Average
Above Average
Average
Above Average
Average
Above Average
Below Average
P
Poor
Excellent
Above Average
Average
Above Average
Above Average
Below Average
P
Poor
Above Average
Average
24
Example: Marada Inn

Rating
Frequency
2
Poor
3
Below Average
5
Average
9
Above Average
1
Excellent
T t l
Total
20
Relative Frequency Distribution

The relative frequency of a class is the fraction or
proportion of the total number of data items
belonging to the class.
class
A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.
25
Percent Frequency Distribution

The percent frequency of a class is the relative
frequency multiplied by 100.
A percent frequency distribution is a tabular
summary of a set of data showing the percent
frequency for each class.
Relative Frequency and

Percent Frequency Distributions
Example: Marada Inn

Relative
Frequency
Rating
.10
Poor
.15
Below Average
.25
Average
.45
Above Average
.05
Excellent
Total
1.00
Percent
Frequency
10
15
25 .10(100) = 10
45
5
100
1/20 = .05
26
Bar Chart
A bar chart is a graphical device for depicting
qualitative data.
On one axis (usually the horizontal axis)
axis), we specify
the labels that are used for each of the classes.
A frequency
frequency,, relative frequency,
frequency, or percent frequency
scale can be used for the other axis (usually the
vertical axis).
Using a bar of fixed width drawn above each class
label, we extend the height appropriately.
The bars are separated to emphasize the fact that each
class is a separate category.
Bar Chart
Marada Inn Quality Ratings
10
9
Frequency
8
7
6
5
4
3
2
1
Poor
Below Average Above Excellent

Average
Average
Rating
27
Pareto Diagram
In quality control, bar charts are used to identify the
most important causes of problems.
When the bars are arranged in descending order of
height from left to right (with the most frequently
occurring cause appearing first) the bar chart is
called a Pareto diagram.
diagram.
This diagram is named for its founder, Vilfredo
Pareto, an Italian economist.
Pie Chart
The pie chart is a commonly used graphical device
for presenting relative frequency and percent
frequency
q
y distributions for categorical
g
data.
First draw a circle

circle;; then use the relative frequencies
to subdivide the circle into sectors that correspond to
the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a
relative frequency
q
y of .25 would consume .25(360)
(
) = 90
degrees of the circle.
28
Pie Chart
Marada Inn Quality Ratings
Excellent
5%
Above
Average
45%
Poor
10%
Below
Average
15%
Average
g
25%
Example: Marada Inn

Insights Gained from the Preceding Pie Chart
One--half of the customers surveyed gave Marada

One
aq
quality
y rating
g of above average
g or excellent
(looking at the left side of the pie). This might
please the manager.
For each customer who gave an excellent rating,

there were two customers who gave a poor
rating (looking at the top of the pie). This should
displease the manager.
manager
29
Summarizing Quantitative Data

Dot Plot
Histogram
Cumulative Distributions
Ogive

The manager of Hudson Auto would like to gain a
engine tune
tune--ups performed in the shop. She examines
50 customer invoices for tune
tune--ups. The costs of parts,
slide.
30

Sample of Parts Cost($) for 50 Tune
Tune--ups
91
71
104
85
62
78
69
74
97
82
93
72
62
88
98
57
89
68
68
101
75
66
97
83
79
52
75
105
68
105
99
79
77
71
79
80
75
65
69
69
97
72
80
67
62
62
76
109
74
73
The three steps necessary to define the classes for a
frequency distribution with quantitative data are:
1 Determine the number of non
1.
non--overlapping classes.
classes
2. Determine the width of each class.
3. Determine the class limits.
31
Guidelines for Determining the Number of Classes

Use between 5 and 20 classes.
Data sets with a larger number of elements

usually require a larger number of classes.
Smaller data sets usually require fewer classes.

The goal is to use enough classes to show the
variation in the data, but not so many classes
that
h some contain only
l a few
f
d
data
items.
Guidelines for Determining the Width of Each Class

Use classes of equal width.
Approximate Class Width =
Largest Data Value Smallest Data Value

Number of Classes
Making the classes the same
width
id h reduces
d
the
h chance
h
off
inappropriate interpretations.
32
Note on Number of Classes and Class Width

In practice, the number of classes and the
appropriate
pp p
class width are determined by
y trial
and error.
Once a possible number of classes is chosen, the
appropriate class width is found.
The process can be repeated for a different

number of classes.
Ulti t l th
Ultimately,
the analyst
l t uses jjudgment
d
t tto
determine the combination of the number of
classes and class width that provides the best
frequency distribution for summarizing the data.
Guidelines for Determining the Class Limits

Class limits must be chosen so that each data
item belongs
g to one and only
y one class.
The lower class limit identifies the smallest
possible data value assigned to the class.
The upper class limit identifies the largest

possible data value assigned to the class.
The appropriate values for the class limits

d
depend
d on the
h level
l
l off accuracy off the
h data.
d
An openopen-end class requires only a
lower class limit or an upper class limit.
33

If we choose six classes:
Approximate Class Width = (109 - 52)/6 = 99.55 10
Parts Cost ($) Frequency
50--59
50
2
60
60--69
13
70--79
70
16
80--89
80
7
90--99
90
7
100
100--109
5
Total 50


Parts
Relative
Percent
Cost ($) Frequency
Frequency
50--59
50
.04
4
60--69
60
.26
2/50 26 .04(100)
70--79
70
.32
32
Percent
80--89
80
.14
14
frequency
q
y is
90--99
90
.14
14
14
the relative
100--109
100
.10
frequency
10
multiplied
Total 1.00
100
by 100.
34


Insights Gained from the % Frequency Distribution:
Only 4% of the parts costs are in the $50$50-59 class.

class
30% of the parts costs are under $70.
The greatest percentage (32% or almost one
one--third)
of the parts costs are in the $70$70-79 class.
10% of the parts costs are $100 or more.
Dot Plot

One of the simplest graphical summaries of data is a

dot plot.
plot.
A horizontal axis shows the range
g of data values.
Then each data value is represented by a dot placed
above the axis.
35
Dot Plot

Tune--up Parts Cost
Tune
50
60
70
80
90
100
110
Cost ($)
Histogram
Another common graphical presentation of
quantitative data is a histogram
histogram..
The variable of interest is placed on the horizontal
axis.
A rectangle is drawn above each class interval with
its height corresponding to the intervals frequency,
frequency,
relative frequency,
frequency, or percent frequency.
frequency.
Unlike a bar graph, a histogram has no natural
separation
i between
b
rectangles
l off adjacent
dj
classes.
l
36
Histogram

18
Tune--up Parts Cost

Tune
16
Frequency
14
12
10
8
6
4
2
Parts
5059 6069 7079 8089 9099 100-110 Cost ($)
Histograms Showing Skewness

Symmetric
Left tail is the mirror image of the right tail
Examples: heights and weights of people
.35
Relativee Frequency
.30
.25
.20
.15
15
.10
.05
0
37

Moderately Skewed Left

A longer tail to the left
Example: exam scores
Relativee Frequency
.35
.30
.25
.20
.15
15
.10
.05
0

Moderately Right Skewed
A Longer tail to the right
Example: housing prices
.35
Relativee Frequency
.30
.25
.20
.15
15
.10
.05
0
38

Highly Skewed Right
A very long tail to the right
Example: executive salaries
.35
Relativee Frequency
.30
.25
.20
.15
15
.10
.05
0
Cumulative frequency distribution shows the
number of items with values less than or equal to the
upper limit of each class..
class
Cumulative relative frequency distribution shows
the proportion of items with values less than or
equal to the upper limit of each class.
Cumulative percent frequency distribution shows
the percentage of items with values less than or
equal to the upper limit of each class.
39
The last entry in a cumulative frequency distribution
always equals the total number of observations.
The last entry in a cumulative relative frequency
distribution always equals 1.00.
The last entry in a cumulative percent frequency
distribution always equals 100.
Hudson Auto Repair
Cost ($)
< 59
< 69
< 79
< 89
< 99
< 109
Cumulative Cumulative
Cumulative
Relative
Percent
Frequency
Frequency
Frequency
2
.04
4
15
.30
30
31 2 + 13 .62 15/50 62 .30(100)
38
.76
76
76
45
.90
90
50
1.00
100
40
Ogive
An ogive is a graph of a cumulative distribution.
The data values are shown on the horizontal axis.
Shown on the vertical axis are the:

cumulative frequencies, or
cumulative relative frequencies, or
cumulative percent frequencies
The frequency (one of the above) of each class is
plotted as a point.
The plotted points are connected by straight lines.
Ogive
Hudson Auto Repair

Because the class limits for the partsparts-cost data are
60-69,, and so on,, there appear
pp
to be one
one--unit
50--59,, 6050
gaps from 59 to 60, 69 to 70, and so on.
These gaps are eliminated by plotting points

halfway between the class limits.
Thus, 59.5 is used for the 5050-59 class, 69.5 is used

for the 6060-69 class, and so on.
41
Ogive with Cumulative Percent Frequencies

Cumulative Percent Frequen

ncy
Tune--up Parts Cost

Tune
100
80
60
(89.5, 76)
40
20
50
60
70
80
90
100
110
Parts
Cost ($)
Descriptive Statistics:
Tabular and Graphical Presentations
Exploratory Data Analysis: StemStem-and

and--Leaf Display
Crosstabulation and Scatter Diagram
42
Exploratory Data Analysis

The techniques of exploratory data analysis consist of
simple arithmetic and easyeasy-to
to--draw pictures that can
be used to summarize data quickly.
One such technique is the stemstem-and
and--leaf display
display..
Stem--and
Stem
and--Leaf Display
A stem
stem--and
and--leaf display shows both the rank order
and shape of the distribution of the data.
It is similar to a histogram
g
on its side, but it has the
advantage of showing the actual data values.
The first digits of each data item are arranged to the
left of a vertical line.
To the right of the vertical line we record the last
digit for each item in rank order.
Each line in the display is referred to as a stem
stem..
Each digit on a stem is a leaf.
leaf.
43

The manager of Hudson Auto would like to gain a
g
tune
tune--ups
p p
performed in the shop.
p She examines
engine
50 customer invoices for tune
tune--ups. The costs of parts,
slide.
Stem--and
Stem
and--Leaf Display
Sample of Parts Cost ($) for 50 Tune
Tune--ups
91
71
104
85
62
78
69
74
97
82
93
72
62
88
98
57
89
68
68
101
75
66
97
83
79
52
75
105
68
105
99
79
77
71
79
80
75
65
69
69
97
72
80
67
62
62
76
109
74
73
44
Stem--and
Stem
and--Leaf Display
5
6
7
8
9
10
a stem
2
2
1
0
1
1
7
2
1
0
3
4
2
2
2
7
5
2
2
3
7
5
5
3
5
7
9
6
4
8
8
7 8 8 8 9 9 9
4 5 5 5 6 7 8 9 9 9
9
9
a leaf
Stretched StemStem-and
and--Leaf Display
If we believe the original stem
stem--and
and--leaf display has
condensed the data too much, we can stretch the
display vertically by using two stems for each
l di digit(s).
leading
di i ( )
Whenever a stem value is stated twice, the first value
corresponds to leaf values of 0 4, and the second
value corresponds to leaf values of 5 9.
45
Stretched StemStem-and
and--Leaf Display
5
5
6
6
7
7
8
8
9
9
10
10
2
7
2
5
1
5
0
5
1
7
1
5
2
6
1
5
0
8
3
7
4
5
2
7
2
5
2
9
2
8 8 8 9 9 9
2 3 4 4
6 7 8 9 9 9
3
7 8 9
9
Stem--and
Stem
and--Leaf Display
Leaf Units
A single digit is used to define each leaf.
In the preceding example,

example the leaf unit was 11.
Leaf units may be 100, 10, 1, 0.1, and so on.
Where the leaf unit is not shown, it is assumed
to equal 1.
The leaf unit indicates how to multiply the stem
stem-and--leaf numbers in order to approximate the
and
original data.
46
Example: Leaf Unit = 0.1

If we have data with values such as
8.6
11.7
9.4
9.1
10.2
11.0
8.8
a stem
stem--and
and--leaf display of these data will be
Leaf Unit = 0.1
8 6 8
9 1 4
10 2
11 0 7
Example: Leaf Unit = 10

If we have data with values such as
1806
1717
1974
1791
1682
1910
1838
a stem
stem--and
and--leaf display of these data will be
Leaf Unit = 10
16 8
17 1 9
18 0 3
19 1 7
The 82 in 1682
is rounded down
to 80 and is
represented as an 8.
47
Crosstabulations and Scatter Diagrams

Thus far we have focused on methods that are used
to summarize the data for one variable at a time.
time.
Often a manager
g is interested in tabular and
graphical methods that will help understand the
relationship between two variables.
variables.
Crosstabulation and a scatter diagram are two
methods for summarizing the data for two variables
simultaneously.
Crosstabulation
A crosstabulation is a tabular summary of data for
two variables.
Crosstabulation can be used when:
one variable is qualitative and the other is
quantitative,
both variables are qualitative, or
both variables are quantitative.
The left and top margin labels define the classes for
the two variables.
48
Crosstabulation
Example: Finger Lakes Homes
The number of Finger Lakes homes sold for each
style and price for the past two years is shown below.
quantitative
categorical
variable
variable
Home Style
Price
Colonial Log Split A
A--Frame Total
Range
< $200,000
> $200,000
18
12
6
14
19
16
12
3
55
Total
30
20
35
15
100
45
Crosstabulation
Insights Gained from Preceding Crosstabulation
The greatest number of homes (19) in the sample

are a split
split--level style and priced at less than
$200,000.
Only three homes in the sample are an A

A--Frame
style and priced at $200,000 or more.
49
Crosstabulation
Frequency
distribution
for the
price range
variable
Home Style
Log Split A
A--Frame
Price
Range
Colonial
< $200,000
> $200,000
18
12
6
14
19
16
12
3
55
Totall
30
20
35
15
100
Total
45
Frequency distribution for

the home style variable
Crosstabulation: Row or Column Percentages

Converting the entries in the table into row
percentages or column percentages can provide
additional insight about the relationship between
the
h two variables.
i bl
50
Crosstabulation: Row Percentages

Price
R
Range
Colonial
C l i l
< $200,000
> $200,000
32.73
26.67
Home Style
Log
L
Split
S lit A
A--Frame
F
10.91 34.55
31.11 35.56
T t l
Total
21.82
6.67
100
100
Note: row totals are actually 100.01 due to rounding.
(Colonial and > $200K)/(All > $200K) x 100 = (12/45) x 100
Crosstabulation: Column Percentages

Price
R
Range
Colonial
C l i l
< $200,000
> $200,000
60.00
40.00
Total
100
Home Style
Log
L
Split
S lit A
A--Frame
F
30.00 54.29
70.00 45.71
100
100
80.00
20.00
100
(Colonial and > $200K)/(All Colonial) x 100 = (12/30) x 100
51
Crosstabulation: Simpsons Paradox

Data in two or more crosstabulations are often
aggregated to produce a summary crosstabulation.
We must be careful in drawing conclusions about the
relationship between the two variables in the
aggregated crosstabulation.
crosstabulation.
In some cases the conclusions based upon an
aggregated crosstabulation can be completely
reversed if we look at the unaggregated data
data.. The
reversal of conclusions based on aggregate and
unaggregated data is called Simpsons paradox.
paradox.
Illustration of Simpsons paradox

Eg: Analysis of verdicts for two judges in two

Eg:
different courts.
JJudges
g Ron Luckett and Dennis Kendall p
presided
over cases in Common Pleas Court and Municipal
Court during the past three years.
Some of the verdicts they rendered were appealed.
In most of these cases the appeals court upheld the
original verdicts, but in some cases those verdicts
were reversed.
52

For each judge a crosstabulation was developed based

upon two variables: Verdict (upheld or reversed) and
Type of Court (Common Pleas and Municipal).
Suppose that the two crosstabulations were then
combined
bi d b
by aggregating
ti th
the type
t
off courtt d
data.
t
The resulting aggregated crosstabulation contains two
variables: Verdict (upheld or reversed) and Judge
(Luckett or Kendall).
This crosstabulation shows the number of appeals in
which the verdict was upheld and the number in
which
hi h the
h verdict
di was reversed
d ffor b
both
h jjudges.
d
The crosstabulation on the next slide shows these
results along with the column percentages in
parentheses next to each value.
Who is doing a better job?

A review of the column percentages shows that 86%
of the verdicts were upheld for Judge Luckett,
Luckett, while
88% of the verdicts were upheld for Judge Kendall.
From this aggregated crosstabulation,
crosstabulation, we conclude
that
h JJudge
d K
Kendall
d ll is
i doing
d i the
h better
b
job
j b because
b
a
greater percentage of Judge Kendalls verdicts are
being upheld.
53
The following unaggregated crosstabulations show

the cases tried by Judge Luckett and Judge Kendall in
each court; column percentages are shown in
parentheses next to each value.
Now, who is performing better?
When we unaggregate the data, we see that Judge

Luckett has a better record because a greater
percentage of Judge Lucketts verdicts are being
upheld in both courts.
This result contradicts the conclusion we reached
with the aggregated data crosstabulation that
showed Judge Kendall had the better record.
This reversal of conclusions based on aggregated and
unaggregated data illustrates Simpsons paradox.
54
The original crosstabulation was obtained by

aggregating the data in the separate crosstabulations
for the two courts.
Note that for both judges the percentage of appeals
that resulted in reversals was much higher
g
in
Municipal Court than in Common Pleas Court.
Because Judge Luckett tried a much higher
percentage of his cases in Municipal Court, the
aggregated data favored Judge Kendall.
When we look at the crosstabulations for the two
courts separately, however, Judge Luckett shows the
better record.
Thus, for the original crosstabulation,
crosstabulation, we see that the
type of court is a hidden variable that cannot be ignored
when evaluating the records of the two judges.
Scatter Diagram and Trendline

A scatter diagram is a graphical presentation of the
relationship between two quantitative variables.
One variable is shown on the horizontal axis and
the other variable is shown on the vertical axis.
The general pattern of the plotted points suggests
the overall relationship between the variables.
A trendline provides an approximation of the
p
relationship.
55
Scatter Diagram
A Positive Relationship
Scatter Diagram
A Negative Relationship
56
Scatter Diagram
No Apparent Relationship
Scatter Diagram
Example: Panthers Football Team
The Panthers football team is interested in
investigating the relationship, if any, between
interceptions made and points scored.
x = Number of
Interceptions
1
3
2
1
3
y = Number of
Points Scored
14
24
18
17
30
57
Scatter Diagram
Nu
umber of Points Sco
ored
y
35
30
25
20
15
10
5
0
x
2
Number of Interceptions
Example: Panthers Football Team

Insights Gained from the Preceding Scatter Diagram
The scatter diagram indicates a positive relationship

between the number of interceptions
p
and the
number of points scored.
Higher points scored are associated with a higher

number of interceptions.
The relationship is not perfect; all plotted points in

the scatter diagram are not on a straight line.
58
Scatter Diagram and Trendline

Scatter Diagram for the Panthers
35
Number of
Points Scored.
30
25
20
15
10
5
0
0
1
2
3
Number of Interceptions
Tabular and Graphical Methods

Data
Categorical Data
Tabular
Methods
Frequency
Distribution
Rel. Freq. Dist.
Percent Freq.
Distribution
Crosstabulation
Quantitative Data
Graphical
Methods
Tabular
Methods
Bar Chart
Pie Chart
Frequency
Distribution
Rel. Freq. Dist.
% Freq. Dist.
q Dist.
Cum. Freq.
Cum. Rel. Freq.
Distribution
Cum. % Freq.
Distribution
Crosstabulation
Graphical
Methods
Dot Plot
Histogram
Ogive
StemStem-and
and-Leaf Display
Scatter
Diagram
59
Descriptive Statistics: Numerical Measures

Measures of Location
Measures of Variability
Measures of Location

Mean
Median
Mode
Percentiles
Quartiles
If the measures are computed

for data from a sample,
they are called sample statistics.
statistics.
If the measures are computed
for data from a population,
they are called population parameters.
parameters.
A sample statistic is referred to

as the point estimator of the
corresponding population parameter.
60
Mean

Perhaps the most important measure of location is

the mean.
mean.
The mean provides a measure of central location
location..
The mean of a data set is the average of all the data
values.
The sample mean x is the point estimator of the
population mean .
Sample Mean x
Sum of the values

of the n observations
i
n
Number of
observations
in the sample
61
Population Mean
Sum of the values

of the N observations
i
N
Number of
observations in
the population
Sample Mean
Example: Apartment Rents
Seventy efficiency apartments were randomly
sampled in a small college town. The monthly rent
prices for these apartments are listed below.
445
440
465
450
600
570
510
615
440
450
470
485
515
575
430
440
525
490
580
450
490
590
525
450
472
470
445
435
435
425
450
475
490
525
600
600
445
460
475
500
535
435
460
575
435
500
549
475
445
600
445
460
480
500
550
435
440
450
465
570
500
480
430
615
450
480
465
480
510
440
62
Sample Mean
x
445
440
465
450
600
570
510
615
440
450
470
485
515
575
430
440
525
490
580
450
490
590
525
450
472
470
445
435
34,356
490.80
70
435
425
450
475
490
525
600
600
445
460
475
500
535
435
460
575
435
500
549
475
445
600
445
460
480
500
550
435
440
450
465
570
500
480
430
615
450
480
465
480
510
440
Median
The median of a data set is the value in the middle
when the data items are arranged in ascending order.
Whenever a data set has extreme values,, the median
is the preferred measure of central location.
The median is the measure of location most often
reported for annual income and property value data.
A few extremely large incomes or property values
can inflate the mean.
63
Median
For an odd number of observations:
26
18
27
12 14
27
19
7 observations
12
14
18
19
27
27
in ascending order
26
the median is the middle value.

Median = 19
Median
For an even number of observations:
26
18
27
12 14
27
30
12
14
18
19
27
27 30
26
19
8 observations
in ascending order
the median is the average of the middle two values.

Median = ((19 + 26)/2
)/ = 22.5
64
Median
Averaging the 35th and 36th data values:
Median = (475 + 475)/2 = 475
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Note: Data is in ascending order.
Trimmed Mean
Another measure, sometimes used when extreme
values are present, is the trimmed mean.
mean.
It is obtained by
y deleting
gap
percentage
g of the
smallest and largest values from a data set and then
computing the mean of the remaining values.
For example, the 5% trimmed mean is obtained by
removing the smallest 5% and the largest 5% of the
data values and then computing the mean of the
remaining
g values.
65
Mode
The mode of a data set is the value that occurs with
greatest frequency.
The g
greatest frequency
q
y can occur at two or more
different values.
If the data have exactly two modes, the data are
bimodal
bimodal..
If the data have more than two modes, the data are
multimodal..
multimodal
Caution: If the data are bimodal or multimodal,
Excels MODE function will incorrectly identify a
single mode.
Mode
450 occurred most frequently (7 times)
Mode = 450
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
66
Percentiles
A percentile provides information about how the
data are spread over the interval from the smallest
value to the largest value.
Admission test scores for colleges and universities
are frequently reported in terms of percentiles.
The pth percentile of a data set is a value such that at

least p percent of the items take on this value or less
and at least (100 - p) percent of the items take on this
value or more.
Percentiles
Arrange the data in ascending order.
Compute index i, the position of the pth percentile.
i = (p
(p/100)
/100)n
n
If i is not an integer, round up. The p th percentile
is the value in the i th position.
If i is an integer, the p th percentile is the average
of the values in positions i and i +1.
67
80th Percentile
i = (p
(p/100)n
/100)n = (80/100)70 = 56
Averaging
g g the 56th and 57th data values:
80th Percentile = (535 + 549)/2 = 542
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
80th Percentile
At least 80% of the
items take on a
value of 542 or less.
At least 20% of the

items take on a
value of 542 or more.
56/70 = .8 or 80%
14/70 = .2 or 20%
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
68
Quartiles
Quartiles are specific percentiles.
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = Median
Third Quartile = 75th Percentile
Third Quartile
Third quartile = 75th percentile
i = ((pp/100)
/100)n
n = (75/100)70 = 52.5 = 53
Third quartile = 525
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
69
It is often desirable to consider measures of variability
(dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
70
Range
The range of a data set is the difference between the
largest and smallest data values.
p
measure of variability.
y
It is the simplest
It is very sensitive to the smallest and largest data
values.
Range
Range = largest value - smallest value
Range = 615 - 425 = 190
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
71
Interquartile Range
The interquartile range of a data set is the difference
between the third quartile and the first quartile.
It is the range
g for the middle 50% of the data.
It overcomes the sensitivity to extreme data values.
Interquartile Range
3rd Quartile (Q
(Q3) = 525
1st Q
Quartile (Q
(Q1)) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
72
Variance
The variance is a measure of variability that utilizes
all the data.
It is based on the difference between the value of
each observation (x
(xi) and the mean ( x for a sample,
for a population).
The variance is useful in comparing the variability
of two or more variables.
Variance
The variance is the average of the squared
differences between each data value and the mean.
The variance is computed as follows:
( xi x )
s
n 1
2
for a
sample
( xi )2
N
for a
population
73
Standard Deviation
The standard deviation of a data set is the positive
square root of the variance.
It is measured in the same units as the data
data,, making
it more easily interpreted than the variance.
Standard Deviation
The standard deviation is computed as follows:
s2
for a
sample
for a
population
74
The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
The coefficient of variation is computed as follows:
100 %
x
100 %
for a
sample
for a
population
Sample Variance, Standard Deviation,

And Coefficient of Variation
Variance
s2
(x
x )2
n1
2,, 996.16
Standard Deviation
s s 2 2996.16
54.74
the standard
deviation is
about 11%
of the mean
54.74
100 % 11.15%
100 %
x
490.80
75
Descriptive Statistics: Numerical Measures

Measures of Distribution Shape, Relative Location,

and Detecting Outliers
Measures of Association Between Two Variables
The Weighted Mean and

Working with Grouped Data
Measures of Distribution Shape,

Relative Location, and Detecting Outliers

Distribution Shape
z-Scores
Chebyshevs
Chebyshev s Theorem
Empirical Rule
Detecting Outliers
76
Distribution Shape: Skewness

An important measure of the shape of a distribution

is called skewness
skewness..
The formula for the skewness of sample
p data is
n
xi x
Skewness
(n 1)(n 2) s
Skewness can be easily computed using statistical

software.

Symmetric (not skewed)
Skewness is zero.
Mean and median are equal.
q
.35
Relativ
ve Frequency
Skewness = 0
.30
.25
.20
.15
.10
.05
0
77

Moderately Skewed Left

Skewness is negative.
Mean will usuallyy be less than the median.
Relativ
ve Frequency
.35
Skewness = .31
.30
.25
.20
.15
.10
.05
0

Moderately Skewed Right
Skewness is positive.
Mean will usuallyy be more than the median.
.35
Relativ
ve Frequency
Skewness = .31
.30
.25
.20
.15
.10
.05
0
78

Highly Skewed Right

Skewness is positive (often above 1.0).
Mean will usuallyy be more than the median.
Skewness = 1.25
Relativ
ve Frequency
.35
.30
.25
.20
.15
.10
.05
0


Seventy efficiency apartments were randomly
sampled
p
in a college
g town. The monthly
y rent p
prices
for the apartments are listed below in ascending order.
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
79

.35
Relative Frequency
Skewness = .92
.30
.25
.20
.15
.10
10
.05
0
z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
value xi is from the mean.
zi
xi x
s
Excels STANDARDIZE function can be used to

compute the zz-score.
80
z-Scores
An observations zz-score is a measure of the relative
location of the observation in a data set.
A data value less than the sample mean will have a
z-score less than zero.
A data value greater than the sample mean will have
a zz-score greater than zero.
A data value equal to the sample mean will have a
z-score of zero.
z-Scores
z-Score of Smallest Value (425)
z
xi x 425 490.80
490 80
1.20
s
54.74
Standardized Values for Apartment Rents

-1.20
-0.93
-0.75
-0.47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0.17
1.45
2.27
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
81
Chebyshevs Theorem
At least (1 - 1/z
1/z2) of the items in any data set will be
within z standard deviations of the mean, where z is
any value greater than 1.
1
Chebyshevs theorem requires z > 1, but z need not
be an integer.
Chebyshevs Theorem
At least 75% of the data values must be
within z = 2 standard deviations of the mean.
82
Chebyshevs Theorem
Let z = 1.5 with x = 490.80 and s = 54.74
At least (1 1/(1.5)2) = 1 0.44 = 0.56 or 56%
of the rent values must be between
x - z(s) = 490.80 1.5(54.74) = 409

and
(
) = 573
x + z(s) = 490.80 + 1.5(54.74)
(Actually, 86% of the rent values
are between 409 and 573.)
Empirical Rule
When the data are believed to approximate a
bell--shaped distribution
bell
The empirical rule can be used to determine the
percentage of data values that must be within a
specified number of standard deviations of the
mean.
The empirical rule is based on the normal
distribution ((will be discussed later))
83
Empirical Rule
For data having a bellbell-shaped distribution:
68.26% of the values of a normal random variable
are within +/mean
+/- 1 standard deviation of its mean.
are within +/+/- 2 standard deviations of its mean.
are within +/+/- 3 standard deviations of its mean.
Empirical Rule
99.72%
95.44%
68
68.26%
26%
3
1
2
+ 3
+ 1
+ 2
84
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a zz--score less than -3 or greater
g
than +3 might be considered an outlier.
It might be:
an incorrectly recorded data value
a data value that was incorrectly included in the
data set
a correctly recorded data value that belongs in
the data set
Detecting Outliers
The most extreme zz-scores are -1.20 and 2.27
Usingg ||zz| > 3 as the criterion for an outlier,, there
are no outliers in this data set.
Standardized Values for Apartment Rents
-1.20
-0.93
-0.75
-0.47
0 47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
0 38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
0 38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
0 34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
0 29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0 29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0 29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0 20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0 20
0.17
1.45
2.27
-0.93
-0.75
-0.47
-0.20
0 20
0.35
1.45
2.27
85

Exploratory data analysis procedures enable us to use
simple arithmetic and easyeasy-to
to--draw pictures to
summarize data.
We simply sort the data values into ascending order
and identify the fivefive-number summary and then
construct a box plot
plot..
FiveFive-Number Summary
1
Smallest Value
First Quartile
Median
Third Quartile
Largest Value
86
FiveFive-Number Summary
First Quartile = 445
Lowest Value = 425
Median = 475
Third Quartile = 525 Largest Value = 615
425
440
450
465
480
80
510
575
430
440
450
470
485
8
515
575
430
440
450
470
490
90
525
580
435
445
450
472
490
90
525
590
435
445
450
475
490
90
525
600
435
445
460
475
500
00
535
600
435
445
460
475
500
00
549
600
435
445
460
480
500
00
550
600
440
450
465
480
500
00
570
615
440
450
465
480
510
0
570
615
Box Plot
A box plot is a graphical summary of data that is
based on a fivefive-number summary.
A key to the development of a box plot is the
computation of the median and the quartiles Q1 and
Q3.
Box plots provide another way to identify outliers.
87
Box Plot
A box is drawn with its ends located at the first and
third quartiles.
A vertical line is drawn in the box at the location of
the median (second quartile).
400 425 450 475 500 525 550 575 600 625
Q1 = 445
Q3 = 525
Q2 = 475
Box Plot
Limits are located (not drawn) using the interquartile
range (IQR).
Data outside these limits are considered outliers
outliers..
The locations of each outlier is shown with the
symbol * .
88
Box Plot
The lower limit is located 1.5(IQR) below Q1.

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325
The upper limit is located 1.5(IQR) above Q3.

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645
There are no outliers (values less than 325 or

greater than 645) in the apartment rent data.
Box Plot
Whiskers (dashed lines) are drawn from the ends

of the box to the smallest and largest
g data values
inside the limits.
4 0 475 500
400
00 425 450
00 525
2 5500 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
89
Box Plot
An excellentgraphicaltechnique for m aking

com parisons am ong tw o or m ore groups.
Measures of Association
Between Two Variables
Thus far we have examined numerical methods used
to summarize the data for one variable at a time.
Often a manager or decision maker is interested in
the relationship between two variables.
variables.
Two descriptive measures of the relationship
between two variables are covariance and correlation
coefficient
coefficient..
90
Covariance
The covariance is a measure of the linear association
between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
Covariance
The covariance is computed as follows:
sxy
xy
( xi x )( yi y )
n 1
( xi x )( yi y )
N
for
samples
for
populations
91
Correlation Coefficient
Correlation is a measure of linear association and not
necessarily causation.
Just because two variables are highly correlated, it
does not mean that one variable is the cause of the
other.
The correlation coefficient is computed as follows:
rxy
sxy
sx s y
for
samples
xy
xy
x y
for
populations
92
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
relationship..
relationship
Values near +1 indicate a strong positive linear
relationship..
relationship
The closer the correlation is to zero, the weaker the
relationship.
Covariance and Correlation Coefficient

Example: Golfing Study
A golfer is interested in investigating the
relationship,
p if any,
y between driving
g distance and
18--hole score.
18
Average Driving
Average
18--Hole Score
Distance (yds.) 18
277.6
69
259.5
71
269.1
70
267.0
70
255.6
71
272.9
69
93

x
277.6
259.5
269.1
267.0
255.6
272 9
272.9
69
71
70
70
71
69
10.65
-7.45
2.15
0.05
-11.35
5 95
5.95
-1.0
1.0
0
0
1.0
-1.0
10
Average 267.0 70.0

Std. Dev. 8.2192 .8944
-10.65
-7.45
0
0
-11.35
-5.95
5 95
Total -35.40

Sample Covariance
sxy
35 40
(x x )(y y ) 35.40
n1
61
7.08
Sample Correlation Coefficient

rxy
sxy
sx s y
7.08
-.9631
(8.2192)(.8944)
94
The Weighted Mean and

Working with Grouped Data
Weighted Mean
Mean for Grouped Data
Variance for Grouped Data
Standard Deviation for Grouped Data
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean
mean..
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
95
Weighted Mean
where:
xi = value of observation i
wi = weight for observation i
Grouped Data
The weighted mean computation can be used to
obtain approximations of the mean, variance, and
standard deviation for the grouped data.
To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
We compute a weighted mean of the class midpoints
using the class frequencies as weights
weights..
Similarly,
y, in computing
p
g the variance and standard
deviation, the class frequencies are used as weights.
96
Mean for Grouped Data

Sample Data
x
fM
i
Population Data
fi M i
N
where:
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped Data

Example:
The previously presented sample of apartment
rents is shown here as grouped data in the form of
a frequency distribution.
Rent ($) Frequency
420-439
440-459
460-479
480-499
500-519
520-539
5
0 539
540-559
560-579
580-599
600-619
8
17
12
8
7
4
2
4
2
6
97
Sample Mean for Grouped Data

Example
Example:: Apartment Rents
Rent ($)
420 439
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560 579
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429 5
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569 5
569.5
589.5
609.5
f iMi
3436 0
3436.0
7641.5
5634.0
3916.0
3566.5
2118.0
1099.0
2278 0
2278.0
1179.0
3657.0
34525.0
34, 525
34
493.21
70
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
Variance for Grouped Data

For sample data
s2
f i ( Mi x ) 2
n 1
For population data
2
fi ( Mi )
N
98
Sample Variance for Grouped Data

Rent ($)
420-439
420
439
440-459
460-479
480-499
500-519
520-539
540-559
560 579
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429 5
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569 5
569.5
589.5
609.5
Mi - x
-63
63.7
7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76 3
76.3
96.3
116.3
(M i - x )2 f i (M i - x )2
4058 96 32471
4058.96
32471.71
71
1910.56 32479.59
562.16
6745.97
13.76
110.11
265.36
1857.55
1316.96
5267.86
3168.56
6337.13
5820 16 23280.66
5820.16
23280 66
9271.76 18543.53
13523.36 81140.18
208234.29
continued
Sample Variance for Grouped Data

Sample Variance
s2 = 208,234.29/(70
208 234 29/(70 1) = 33,017.89
017 89
Sample Standard Deviation

s 3,017.89 54.94
This approximation differs by only $.20
from the actual standard deviation of $54.74.
99
Chapter 4
Introduction to Probability
Experiments, Counting Rules,
and Assigning Probabilities
Events and Their Probability
Some Basic Relationships
of Probability
Conditional Probability
Bayes Theorem
Uncertainties
Managers often base their decisions on an analysis
of uncertainties such as the following:
What are the chances that sales will decrease
if we increase prices?
What is the likelihood a new assembly method
will increase productivity?
What are the odds that a new investment will
be profitable?
100
Probability
Probability is a numerical measure of the likelihood
that an event will occur.
Probability values are always assigned on a scale
from 0 to 1.
A probability near zero indicates an event is quite
unlikely to occur.
A probability near one indicates an event is almost
certain to occur.
Probability as a Numerical Measure

of the Likelihood of Occurrence
Increasing Likelihood of Occurrence
Probability:
0
The event
is very
unlikely
to occur.
occur
.5
The occurrence
of the event is
just as likely as
it is unlikely.
unlikely
1
The event
is almost
certain
to occur.
occur
101
Statistical Experiments
In statistics, the notion of an experiment differs
somewhat from that of an experiment in the
physical
sciences.
p
y
In statistical experiments, probability determines
outcomes.
Even though the experiment is repeated in exactly
the same way, an entirely different outcome may
occur.
For this reason, statistical experiments are somesometimes called random experiments.
experiments.
An Experiment and Its Sample Space

An experiment is any process that generates wellwelldefined outcomes.
The sample space for an experiment is the set of
all experimental outcomes.
An experimental outcome is also called a sample
point.
point.
102

Experiment
Experiment Outcomes
Toss a coin
Inspection a part
Conduct a sales call
Roll a die
Play a football game
Head, tail
Defective, nonnon-defective
Purchase, no purchase
1, 2, 3, 4, 5, 6
Win, lose, tie

Example: Bradley Investments
Bradley has invested in two stocks, Markley Oil
and Collins Mining. Bradley has determined that the
possible
bl outcomes off these
h
investments three
h
months
h
from now are as follows.
Investment Gain or Loss
in 3 Months (in $000)
Markley Oil Collins Mining
8
10
2
5
0
20
103
A Counting Rule for

Multiple--Step Experiments
Multiple
If an experiment consists of a sequence of k steps
in which there are n1 possible results for the first step,
n2 possible results for the second step, and so on,
then the total number of experimental outcomes is
)(n
n2) . . . (n
(nk).
given by (n
(n1)(
A helpful graphical representation of a multiple
multiple--step
experiment is a tree diagram.
diagram.
A Counting Rule for

Multiple--Step Experiments
Multiple
Bradley Investments can be viewed as a twotwo-step
experiment. It involves two stocks, each with a set of
experimental outcomes.
Markley Oil:
Collins Mining:
Total Number of
Experimental
xpe
e ta Outcomes:
Outco es:
n1 = 4
n2 = 2
n1n2 = ((4)(2)
)( ) = 8
104
Tree Diagram
Markley Oil
(Stage 1)
Collins Mining
(Stage 2)
Gain 8
Gain 10
Gain 8
Gain 5
Lose 2
Lose 2
Gain 8
Even
Lose 20
Gain 8
Lose 2
Lose 2
Experimental
Outcomes
(10, 8)
Gain $18,000
(10, -2) Gain
$8,000
(5, 8)
Gain $13,000
(5, -2)
Gain
$3,000
((0, 8))
Gain
$8,000
(0, -2)
Lose
$2,000
(-20, 8) Lose $12,000

(-20, -2) Lose $22,000
Counting Rule for Combinations

Number of Combinations of N Objects
Taken n at a Time
A second useful counting rule enables us to count
the number of experimental outcomes when n objects
are to be selected from a set of N objects.
where:
h
N! = N(N 1)(N
1)(N 2) . . . (2)(1)
n! = n(n 1)(n
1)(n 2) . . . (2)(1)
0! = 1
105
Counting Rule for Permutations

Number of Permutations of N Objects
Taken n at a Time
A third useful counting rule enables us to count
the
h number
b off experimentall outcomes when
h n
objects are to be selected from a set of N objects,
where the order of selection is important.
where:
N! = N(N 1)(N
1)(N 2) . . . (2)(1)
n! = n(n 1)(n
1)(n 2) . . . (2)(1)
0! = 1
Assigning Probabilities
Basic Requirements for Assigning Probabilities
1. The probability assigned to each experimental
outcome must be between 0 and 11, inclusively
inclusively.
0 < P(Ei) < 1 for all i
where:
Ei is the ith experimental outcome
and P(Ei) is its probability
106
Basic Requirements for Assigning Probabilities
2. The sum of the probabilities for all experimental
1
outcomes must equal 1.
P(E1) + P(E2) + . . . + P(En) = 1
where:
n is the number of experimental outcomes
Classical Method
Assigning probabilities based on the assumption
off equally
ll likely
lik l outcomes
t
Relative Frequency Method
Assigning probabilities based on experimentation
or historical data
S bj ti Method
Subjective
M th d
Assigning probabilities based on judgment
107
Classical Method
Example: Rolling a Die
If an experiment has n possible outcomes, the
classical method would assign a probability of 1/n
1/n
to each outcome.
Experiment: Rolling a die
Sample Space: S = {1, 2, 3, 4, 5, 6}
Probabilities: Each sample point has a
1/6 chance of occurring

Example: Lucas Tool Rental
Lucas Tool Rental would like to assign probabilities
to the number of car polishers it rents each day.
Office records show the following frequencies of daily
rentals for the last 40 days.
Number of
Polishers Rented
0
1
2
3
4
Number
of Days
4
6
18
10
2
108

Example: Lucas Tool Rental
Each probability assignment is given by dividing
the frequency
of days)
q
y (number
(
y ) by
y the total frequency
q
y
(total number of days).
Number of
Polishers Rented
0
1
2
3
4
Number
of Days
4
6
18
10
2
40
Probability
.10
.15
.45
45
4/40
.25
.05
1.00
Subjective Method
When economic conditions and a companys
circumstances change rapidly it might be
inappropriate to assign probabilities based solely on
hi t i l d
historical
data.
t
We can use any data available as well as our
experience and intuition, but ultimately a probability
value should express our degree of belief that the
experimental outcome will occur.
The best p
probability
y estimates often are obtained by
y
combining the estimates from the classical or relative
frequency approach with the subjective estimate.
109
Subjective Method
An analyst made the following probability estimates.
N
Nett G
Gain
i or Loss
L
P
Probability
b bilit
E
Exper.
Outcome
O t
(10, 8)
.20
$18,000 Gain
(10, 2)
.08
$8,000 Gain
(5, 8)
.16
$13,000 Gain
(5, 2)
.26
$3,000 Gain
(0, 8)
.10
$8,000 Gain
(0, 2)
$2,000 Loss
.12
(20, 8)
$12,000 Loss
.02
(20, 2)
$22,000 Loss
.06
Events and Their Probabilities

An event is a collection of sample points.
The probability of any event is equal to the sum of
the probabilities of the sample points in the event.
If we can identify all the sample points of an
experiment and assign a probability to each, we
can compute the probability of an event.
110

Event M = Markley Oil Profitable
M = {(10, 8), (10, 2), (5, 8), (5, 2)}
P(M) = P(10, 8) + P(10, 2) + P(5, 8) + P(5, 2)
= .20 + .08 + .16 + .26
= .70

Event C = Collins Mining Profitable
C = {(10, 8), (5, 8), (0, 8), (
(20, 8)}
P(C) = P(10, 8) + P(5, 8) + P(0, 8) + P(20, 8)
= .20 + .16 + .10 + .02
= .48
111
Some Basic Relationships of Probability

There are some basic probability relationships that
can be used to compute the probability of an event
without knowledge of all the sample point probabilities.
Complement of an Event
Union of Two Events
Intersection of Two Events
Mutually Exclusive Events
Complement of an Event
The complement of event A is defined to be the event
consisting of all sample points that are not in A.
The complement of A is denoted by Ac.
Event A
Ac
Sample
Space S
Venn
Diagram
112
Union of Two Events

The union of events A and B is the event containing
all sample points that are in A or B or both.
The union of events A and B is denoted by A B
B
Event A
Event B
Sample
Space S
Union of Two Events

E
Event
C = Collins
C lli Mining
Mi i Profitable
P fi bl
M C
C = Markley Oil Profitable
or Collins Mining Profitable (or both)
M C
C = {(10, 8), (10, 2), (5, 8), (5, 2), (0, 8), ((
20, 8)}
P(M C)
C) = P(10, 8) + P(10, 2) + P(5, 8) + P(5, 2)
+ P(0, 8) + P(20, 8)
= .20 + .08 + .16 + .26 + .10 + .02
= .82
113

The intersection of events A and B is the set of all
sample points that are in both A and B.
The intersection of events A and B is denoted by A
Event A
Event B
Sample
Space S
Intersection of A and B

Event
E
C = Collins
C lli Mining
Mi i Profitable
P fi bl
M C
and Collins Mining Profitable
M C
C = {(10, 8), (5, 8)}
P(M C)
C) = P(10, 8) + P(5, 8)
= .20 + .16
= .36
114
Addition Law
The addition law provides a way to compute the
probability of event A, or B, or both A and B occurring.
The law is written as:
P(A B
B) = P(A) + P(B) P(A B
Addition Law
M C
or Collins Mining Profitable
We know: P(M) = .70, P(C) = .48, P(M C
C) = .36
Thus: P(M C) = P(M) + P(C
P(C) P(M C)
= .70
70 + .48
48 .36
36
= .82
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
115

Two events are said to be mutually exclusive if the
events have no sample points in common.
Two events are mutually exclusive if, when one event
occurs, the other cannot occur.
occurs,
Event A
E nt B
Event
Sample
Space S

If events A and B are mutually exclusive, P(A B = 0.
Th addition
The
dditi llaw for
f mutually
t ll exclusive
l i events
t is:
i
P(A B
B) = P(A) + P(B)
There is no need to
include
P(A B
116
The probability of an event given that another event
has occurred is called a conditional probability
probability..
The conditional probability of A given B is denoted
by P(A|B).
A conditional probability is computed as follows :
E
Event
C = Collins
C lli Mining
Mi i Profitable
P fi bl
= Collins Mining Profitable
given Markley Oil Profitable
We know: P(M C
C) = .36, P(M) = .70
Thus:
117
Multiplication Law
The multiplication law provides a way to compute the
probability of the intersection of two events.
P(A B
B) = P(B)P(A|B)
Multiplication Law
M C
and Collins Mining Profitable
We know: P(M) = .70, P(C|M) = .5143
Thus: P(M C) = P(M)P(M|C
M|C))
= (.70)(.5143)
= .36
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
118
Joint Probability Table

Collins Mining
Profitable (C) Not Profitable (Cc)
Markley Oil
Total
Profitable (M)
.36
.34
.70
Not Profitable (Mc)
.12
.18
.30
Total
.48
.52
1.00
Joint Probabilities
(appear in the body
of the table)
Marginal Probabilities
(appear in the margins
of the table)
Independent Events
If the probability of event A is not changed by the
existence of event B, we would say that events A
and B are independent
independent.
p
.
Two events A and B are independent if:
P(A|B) = P(A)
or
P(B|A) = P(B)
119
Multiplication Law
for Independent Events
The multiplication law also can be used as a test to see
if two events are independent.
P(A B
B) = P(A)P(B)
Multiplication Law
for Independent Events
E
Event
t C = Collins
C lli Mining
Mi i Profitable
P fit bl
Are events M and C independent?
Does
DoesP(M C) = P(M)P(C) ?
We know: P(M C) = .36, P(M) = .70, P(C) = .48
But: P(M)P(C) = (.70)(.48) = .34, not .36
H
Hence:
M and
d C are nott independent.
i d
d t
120
Mutual Exclusiveness and Independence

Do not confuse the notion of mutually exclusive
events with that of independent events.
Two events with nonzero probabilities cannot be
both mutually exclusive and independent.
If one mutually exclusive event is known to occur,
the other cannot occur.; thus, the probability of the
other event occurring is reduced to zero (and they
are therefore
dependent).
h f
d
d )
Two events that are not mutually exclusive, might
or might not be independent.
Bayes Theorem
Often we begin probability analysis with initial or
prior probabilities.
probabilities.
Then, from a sample,
report,
or a p
product
p special
p
p
test we obtain some additional information.
Given this information, we calculate revised or
posterior probabilities.
probabilities.
Bayes theorem provides the means for revising the
prior probabilities.
Prior
Probabilities
New
Information
Application
of Bayes
Theorem
Posterior
Probabilities
121
Bayes Theorem
Example: L. S. Clothiers
A proposed shopping center will provide strong
competition
p
for downtown businesses like L. S.
Clothiers. If the shopping center is built, the owner
of L. S. Clothiers feels it would be best to relocate to
the shopping center.
The shopping center cannot be built unless a
zoning change is approved by the town council.
The planning board must first make a
recommendation, for or against the zoning change,
to the council.
Prior Probabilities
Let:
A1 = town council approves the zoning change
A2 = town council disapproves the change
Using subjective judgment:
P(A
P(
A1) = .7, P(
P(A
A2) = .3
122
New Information
The planning board has recommended against
the zoning
g change.
g Let B denote the event of a
negative recommendation by the planning board.
Given that B has occurred, should L. S. Clothiers
revise the probabilities that the town council will
approve or disapprove the zoning change?
Conditional Probabilities
Past history with the planning board and the town
council indicates the following:
g
Hence:
P(B|A1) = .2
P(B|A2) = .9
P(BC|A1) = .8
P(BC|A2) = .1
123
Tree Diagram
Town Council Planning Board
P(A1) = .7
Experimental
Outcomes
P(B|A1) = .2
P(A1 B) = .14
P(Bc|A1) = .8
P(A1 Bc) = .56
P(B|A2) = .99
P(A2 B) = .27
27
P(Bc|A2) = .1
P(A2 Bc) = .03
P(A2) = .3
Bayes Theorem
To find the posterior probability that event Ai will
occur given that event B has occurred, we apply
Bayes theorem.
theorem.
Bayes theorem is applicable when the events for

which we want to compute posterior probabilities
are mutually exclusive and their union is the entire
sample space.
124
Posterior Probabilities
Given the planning boards recommendation not
to approve
the zoning
pp
g change,
g we revise the prior
p
probabilities as follows:
= .34
Posterior Probabilities
The planning boards recommendation is good
news for L. S. Clothiers. The p
posterior p
probability
y of
the town council approving the zoning change is .34
compared to a prior probability of .70.
125
Bayes Theorem: Tabular Approach

Step 1
Prepare the following three columns:
Column 1 The mutually exclusive events for
which posterior probabilities are desired.
Column 2 The prior probabilities for the events.
Column 3 The conditional probabilities of the
new information given each event.

Step 1
(1)
(2)
(3)
(4)
(5)
Conditional
Prior
Events Probabilities Probabilities
Ai
P(Ai)
P(B|Ai)
A1
.7
.2
A2
.33
.99
1.0
126

Step 2
Prepare
p
the fourth column:
Column 4
Compute the joint probabilities for each event and
the new information B by using the multiplication
law.
Multiply the prior probabilities in column 2 by
th corresponding
the
di conditional
diti
l probabilities
b biliti in
i
column 3. That is, P(Ai B) = P(Ai) P(B|Ai).

Step 2
(1)
(2)
(3)
(4)
(5)
Conditional
Prior
Joint
Events Probabilities Probabilities Probabilities
Ai
P(Ai)
P(B|Ai)
P(Ai B)
A1
.7
.2
.14
A2
.33
.99
27
.27
1.0
.7 x .2
127

Step 2 (continued)
We see that there is a .14 p
probability
y of the town
council approving the zoning change and a
negative recommendation by the planning board.
There is a .27 probability of the town council
disapproving the zoning change and a negative
recommendation by the planning board.

Step 3
Sum the jjoint probabilities
p
in Column 4. The
sum is the probability of the new information,
P(B). The sum .14 + .27 shows an overall
probability of .41 of a negative recommendation
by the planning board.
128

Step 3
(1)
(2)
(3)
(4)
(5)
Conditional
Prior
Joint
Events Probabilities Probabilities Probabilities
Ai
P(Ai)
P(B|Ai)
P(Ai B)
A1
.7
.2
.14
A2
.33
.99
.27
27
1.0
P(B) = .41

Step 4
Prepare
p
the fifth column:
Column 5
Compute the posterior probabilities using the
basic relationship of conditional probability.
The joint probabilities P(Ai B) are in column 4

and the probability P(B) is the sum of column 4.
129

Step 4
(1)
(2)
(3)
(4)
(5)
Prior
Joint
Posterior
Conditional
Events Probabilities Probabilities Probabilities Probabilities
P(B|Ai)
P(Ai)
P(Ai B)
P(Ai |B)
Ai
A1
.7
.2
.14
.3415
A2
.33
.99
.27
27
.6585
6585
P(B) = .41
1.0000
1.0
.14/.41
130

1st Set of Slides - 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1st Set of Slides - 2

Uploaded by

Copyright:

Available Formats

Data Analytics - 1

Applications in Business and Economics

Ethical Guidelines for Statistical Practice

The goall off statistics

Financial advisors use price-earnings ratios and

Electronic point-of-sale scanners at retail checkout

A variety of statistical quality control charts are used

Data and Data Sets

Data are the facts and figures collected, analyzed,

Elements, Variables, and Observations

Data, Data Sets,

The scale determines the amount of information

A nonnumeric label or numeric code may

The data have the properties of nominal data and

O ne uniton this scale is the sam e size anyw here

Categorical and Quantitative Data

Time Series Data

Time Series Data

Internal company records almost any department

Data Available From Internal Company Records

Some of the Data Available

part number, quantity produced,

product number, sales volume, sales

Statistical Studies - Experimental

Statistical Studies - Observational

Data Acquisition Considerations

Searching for information can be time consuming.

Organizations often charge for information even

Using any data that happen to be available or were

Such summaries of data, which may be tabular,

Example: Hudson Auto Repair

Example: Hudson Auto Repair

Sample of Parts Cost ($) for 50 Tune-ups

Example: Hudson Auto

Graphical Summary: Histogram

Tune--up Parts Cost

Numerical Descriptive Statistics

Process of Statistical Inference

4. The sample average

3. The sample data

Computers and Statistical Analysis

Data Mining Applications

Data Mining Requirements

Also needed are computer science technologies

Data Mining Model Reliability

Ethical Guidelines for Statistical Practice

Summarizing Categorical Data

Summarizing Categorical Data

Example: Marada Inn

Relative Frequency Distribution

Percent Frequency Distribution

Relative Frequency and

Example: Marada Inn

Below Average Above Excellent

First draw a circle

Example: Marada Inn

Insights Gained from the Preceding Pie Chart

One--half of the customers surveyed gave Marada

For each customer who gave an excellent rating,

Summarizing Quantitative Data

Example: Hudson Auto Repair

Example: Hudson Auto Repair

Guidelines for Determining the Number of Classes