Professional Documents
Culture Documents
Instructor:
Email: yedeshi@zju.edu.cn
Treatment of data
Graphs: Pareto Diagram, Dot Diagram, Box-plot
Frequency distribution, Stem-and-leaf Displays
Course information
What is for?
This course provides an elementary
introduction to probability with applications.
Topics include:
axioms of probability;
basic probability concepts and models (counting
methods , conditional probability, Bayes
theorem,et.);
random variables;independence;
discrete and continuous probability distributions;
calculate mathematical expectation and variance;
limit theory
Course Goals
Students at the end of course should
be able to do the following:
1) Understand the concepts and methods
of probability theory
References:
1) A First course in Probability (6th Ed), Sheldon Ross.
China Statistics Press.
2) Probability & Statistics for Engineers & Scientists
(7th Ed), R.E. Walpole,R.H. Myers, S.L. Myers, K. Ye.
Tshinghua or Pearson Education Press.
Grading
Grades for the course will be based
on the following weighting
1) Class attendance and Homework
assignment: 36%
2) Unit quiz: 24% (12%, 12%)
3) Final exam: 40%
Homework
1) You may collaborate on homework, but
you must write your submitted work in your
own words. All steps are required, this
includes showing calculations, derivations,
and proofs. Solutions will be posted on the
class web site.
http://www.cs.zju.edu.cn/people/yedeshi/software/MiniTAB14.iso
Probability in CS
Randomized algorithms
Querying Theory
Software testing
Stable: 217.5
Good quality:
[215, 220]
Ch2: Treatment of data
Outline
Pareto diagrams, dot diagrams
Histograms (Frequency distributions)
Stem-and-leaf display
Box-plot (Quartiles and Percentiles)
The calculation of mean x and standard
deviation s
What it is –
Descriptive statistics
Descriptive statistics include the numbers, tables, charts,
and graphs used to describe, organize, summarize, and
present raw data.
central tendency (location) of data, i.e. where data tend to
fall, as measured by the mean, median, and mode.
dispersion (variability) of data, i.e. how spread out data are,
as measured by the variance and its square root, the standard
deviation.
skew (symmetry) of data, i.e. how concentrated data are at the
low or high end of the scale, as measured by the skew index.
kurtosis (peakedness) of data, i.e. how concentrated data are
around a single value, as measured by the kurtosis index.
Pareto Diagram
Pareto Diagram display orders each type of
failure or defect according to its frequency.
0.6 60
Percent
Count
0.4 40
0.2 20
0.0 0
Soft-defect Requirement design other Code
Count 0.56 0.27 0.10 0.07
Requirements Percent 56.0 27.0 10.0 7.0
56% Cum % 56.0 83.0 93.0 100.0
Dot diagram
Second step to improve the quality of lathe,
Data were collected from observation on the
deviations of cutting speed from the target value set
by the controller.
EX. Cutting speed – target speed
3 6 –2 4 7 4
Dot diagram: A number line in which one dot is placed
above a value on the number line for each occurrence
of that value. That is, one dot means the value
occurred once, three dots mean the value occurred
three times, etc.
In minitab: stat->dotplots->simple
Dot diagram
This diagram visually summarize the
information that the lathe is generally
running fast.
Multiple sample
C1: 0.27 0.35 0.37
C2: 0.23 0.15 0.25 0.24 0.30 0.33 0.26
Dotplot of C1, C2
Variable
0.15 0.18 0.21 0.24 0.27 0.30 0.33 0.36 C1
Data C2
Frequency distributions
A frequency distribution is a
tabular arrangement of data whereby
the data is grouped into different
intervals, and then the number of
observations that belong to each
interval is determined.
Data that is presented in this manner
are known as grouped data.
Data001.
80 data of emission (in ton)of
sulfur oxides from an industry plant
15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0 13.2
22.7 9.8 6.2 14.7 17.5 26.1 12.8 28.6 17.6 23.7 26.8
22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7 19.1
15.2 22.9 26.6 20.4 21.4 19.2 21.6 16.9 19.0 18.5
23.0
24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4 29.6
19.4 17.0 20.8 24.3 22.5 24.6 18.4 18.1 8.3 21.9
12.3
22.3 13.3 11.8 19.3 20.0 25.7 31.8 25.9 10.5 15.9
27.5 18.1 17.9 9.4 24.1 20.1 28.5
Class limits & frequnecy
Class limits Frequency
5.0 -- 8.9 3
9.0 – 12.9 10
13.0 – 16.9 14
17.0 – 20.9 25
21.0 – 24.9 17
25.0 – 28.9 9
29.0 – 32.9 2
Total 80
Class limit and width
lower class limit: The smallest value that
can belong to a given interval
1. Graph->histogram->simple
2. Graph variables: c4 (all data in a column)
3. Edit bars: Click the bars in the output figures, in
Binning, Interval type select midpoint and interval
definition select midpoint/cutpoint, and then input 7
11 15 19 23 27 31 as illustrated in the following
Density histogram
When a histogram is constructed from a
frequency table having classes of unequal
lengths, the height of each rectangle must
be changed to
Example: 11 numbers:
12, 13, 21, 27, 33, 34, 35, 37, 40, 40, 41
Frequency table
Class limits Frequency
10 – 19 2
20 – 29 2
30 – 39 4
40 – 49 3
Stem-and-leaf
2 0 67
6 0 8999
11 1 00111
17 1 223333
24 1 4445555
32 1 66677777
(13) 1 8888888999999
35 2 0000000111
25 2 222223333
16 2 4444455
9 2 66667
4 2 889
1 3 1
Ch2.5: Descriptive measures
Mean: the sum of the observation divided
by the sample size. n
xi
x i 1
n
Median: the center, or location, of a set of
data. If the observations are arranged in an
ascending or descending order:
If the number of observations is odd, the
median is the middle value.
If the number of observations is even, the
median is the average of the two middle values.
Example
15 14 2 27 13
Mean:
15 14 2 27 13
x 14.2
5
i
( x x ) 2 n n
n x ( xi )2
2
s2 i 1 i
s2 i 1 i 1
n 1 n(n 1)
Standard deviation s:
n
i
( x x ) 2
s i 1
n 1
Quartiles and Percentiles
Quartiles: are values in a given set of
observations that divide the data in 4 equal
parts.
The first quartile, Q1 , is a value that has one
fourth, or 25%, of the observation below its
value.
The sample 100 p-th percentile is a value
such that at least 100p% of the observation
are at or below this value, and at least
100(1-p)% are at or above this value.
Example
Example in P34:
N/4 is an
14.7 15.2 integer, take
Q1 14.95 the average;
2 Or round up,
otherwise
19.0 19.1
Q2 19.05
2
22.9 23
Q3 22.95
2
Boxplots
A boxplot is a way of summarizing
information contained in the quartiles
(or on a interval)
Box length= interquartile range= Q3 Q1
Quartile calculation in Minitab
The first quartile (Q1) is the observation at position
(n+1) / 4, and the third quartile (Q3) is the
observation at position 3(n+1) / 4, where n is the
number of observations. If the position is not an
integer, interpolation is used.
For example, suppose n=10. Then (10 + 1)/4 = 2.75,
and Q1 is between the second and third observations
(call them x2 and x3), three-fourths of the way up.
Thus, Q1 = x2 + 0.75(x3 - x2). Since 3(10 + 1)/4 =
8.25, Q3 = x8 + 0.25(x9 - x8), where x8 and x9 are
the eight and ninth observations.
Indeed, Choose “Hinges” in BoxEndpoints, will get
Quartile as in Textbook.
Upper limit = Q3 + 1.5 (Q3 - Q1)
Modified boxplot
Outlier: too far from third
quartile.
Largest observation
within 1.5(interquartile
range) of third quartile.
Modified boxplot: identify
outliers and reduce the
effect on the shape of the
boxplot.
Lower limit = Q1- 1.5 (Q3 - Q1)
Homework 1
Problems in Textbook (2.62,
2.67,2.71, 2.72, 2.75) 4 points
Thanks !