You are on page 1of 67

*

*Probability & Statistics


*Probability Theory
*A branch of mathematics concerned with analysis
of random phenomena.
*Statistics
*A branch of mathematics dealing with the
collection, analysis, interpretation, and
presentation of masses of numerical data.
Source: Merriam-Webster Dictionary
*Why use Prob & Stats (P&S)?
*Uncertainties in real world problems are unavoidable.

*Quantitative risk-benefit trade-offs based on P&S can be used
to quantify uncertainty and evaluate its effect on the
performance and design of a system.

*Its application ranges from describing basic information to the
development of bases for design and decision making.

*Uncertainty
*Represents the natural randomness of an underlying phenomenon.
*Successive observations of a system or phenomenon do not produce
exactly the same result.
*Number of credit hours P&S students are registered for.
*Two types:
*Aleatory - Inherent to the phenomenon being studied.
* We cannot control it.
*Epistemic - Associated with inaccuracies in predictions or estimates.
* Can be reduced with improvements in estimation methods.
*Design under Uncertainty
*Mastery of probabilities allows to assess risks and make
better decisions.

*Often involves trade-offs
*Too conservative Excessively costly
*Too lax Sacrifices safety



*Population vs. Sample
* Sample collections of observations.
* Incomplete knowledge
* Estimators: ,
* Population collections of all individuals or individual items of a particular
type.
* Complete knowledge
* Parameters: ,
* Typically, we collect samples and use these samples to make inferences
about a population.
* Generalizing info in sample often leads to errors.


*Discrete vs. Continuous
*Continuous - measured on a continuum.
*Associated with an finite or infinite interval of real numbers.
* | 0
* | 5
*Its precision depends on the measuring instrument.
* i.e., number of decimal digits recorded
*Examples:
* Process yield
* Water absorbency


*Discrete vs. Continuous
*Discrete Associated with a finite ( = {0,1,2,3,4,5}) or infinite
( = {0,1, , }) range
*Nominal order does not matter, often used to binary data
* Examples:
* Tumor: Cancerous, Non-Cancerous
* Eye Color: Blue, Green, Brown, Black
*Ordinal order matters, often used to represent count data
* Example:
* Patients age group: 0 - 9, 10 - 19, 20 - 29,, 91+
*Data Type Examples
*Discrete:
*Number of scratches on a surface.
*Number of transmitted bits received in error.
*Number of common stock shares traded per day.

*Continuous:
*Electrical current and voltage.
*Physical measurements
* i.e., length, weight, time, temperature, pressure.
*Discrete or continuous?
1. Number of scratches in your brand new car.
2. Steps between your apartments exit and II-204.
3. Workout time at the gym.
4. Number of blue M&Ms in a snack size bag.
5. Duration of Pitbulls latest single.



*Descriptive Statistics
*A set of single-number statistics that provide information about a
sample regarding the:
* Center and/or other location of the data
* Variability
* Shape
*Often, accompanied by graphs.
* Example: Exam grades

*Measures of Location
*Provide the analyst with quantitative values of where the
center, or some other location, of data is located.
*Central Tendency = location of the center
*Central Tendency
* Given a sample of n observations

* Sample mean
* Sensitive to outliers
* Most widely used

* Weighted sample mean
* Considers non-equally important criteria

* Trimmed Mean
* Removes a certain number of the largest and smallest values in a sample.
* % trimmed mean indicates which percentage to shed from each tail

* Median p
50
* Observations must be arranged in increasing order of magnitude
* More robust estimator of the mean
* Uninfluenced by extreme values or outliers
* Observation in the middle
n
x
x
n
i
i
=
=
1

=
=
=
n
i
i
n
i
i i
w
w
x w
x
1
1
=
+ 1 /2

+1
2
+
+ 1
2
2

*Weighted Average Example # 1
*Suppose at the end of your first semester, you want to calculate your
GPA. The following table describes the amount of credits you were
enrolled in and the grades you obtained:






*What is your GPA?

Course Credits Grade
CHEM 4 B
MATH 4 A
PSYCH 3 A
ENGL 3 C
SPAN 3 B
= 4 3 + 4 4 + 3 4 + 3 2 + 3 3 4 +4 + 3 +3 + 3 = 3.24
Absolute weights
*Weighted Average Example # 2
*Suppose at the end of your first semester, you want to
calculate your attendance grade under the following
conditions:
*Attendance was recorded 28 times.
*You have one excused and five unexcused absences.
*You did not come prepared to class 3 times.

*What is your attendance grade?
1 20 + 0 5 + 0.5 3
28
0.05 = 3.84%
Absolute weights
*Weighted Average Example # 3

*Suppose at the end of your first semester, you want to calculate your
GPA. The following table describes the amount of credits you were
enrolled in and the grades you obtained:






*What is your final grade?

Item Score Value Grade %
Exam 1 65 100 20%
Exam 2 75 100 20%
Exam 3 89 100 20%
Final Exam 91 100 20%
Quizzes/Projects 135 150 15%
Attendance 30 30 5%
65
100
0.2 +
75
100
0.20 +
89
100
0.20 +
91
100
0.2 +
135
150
0.2 +
30
30
0.05 = 0.825
Relative weights
*Another Measure of Central Tendency
*Mode
*Most frequent value in a data set or probability distribution
*Example: 4 4 2 5 4 11 12 1 5 6 6 7 8 3 10 9 3
* SORTED: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 8, 9, 10, 11, 12





Value Freq Value Freq
1 1 7 1
2 1 8 1
3 2 9 1
4 3 10 1
5 2 11 1
6 2 12 1
F
r
e
q
u
e
n
c
y
0 2 4 6 8 10 12 14
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
*Example: Central Tendency
STEP 1: Order observations from smallest to largest
*Ordered observations: 27, 75, 76, 78, 80, 83, 84, 86, 87, 110

*Mean: =
27+75++110
10
= 78.6
*10% Trimmed Mean:
(10)
=
75++87
8
= 81.125
*Median:
* Index: ( + 1)/2 = 11/2 = 5.5
* =
80+83
2
= 81.5
*Weighted Average:
* Index: 2 /2 + 8 = 1, =
1
9

*

= 1/18 27 + 1/9 75 + 1/9 76 + + 1/9 87 +1/18 110 = 79.72







Less sensitive to outliers
Not as insensitive as the median
Everything but the middle is trimmed
*Other Measures of Location
*Maximum or Minimum
*Percentiles

or quantiles

*Given a sample of observations, a value for which a specified
fraction of the data values is less than or equal to


* can be any number between 1 and 100

*Quartiles
* 1: = 0.25
* 2: = 0.50
* 3: = 0.75



*Quartiles
*Q1 is the median of the first half of the observations
*Q3 is the median of the second half of the observations


*Example:
*Odd-number of observations: 1, 3, 4, 5, 7, 8, 10


*Even-number of observations: 1, 3, 4, 5, 7, 8, 10, 12

HINT: We always need to sort our observations
1 = 3.5 3 = 9
1 = 3 3 = 8
*Percentiles
*Different software use different equations.
*One potential formula:
* = +0.5
*where:
* = index
* First observation: = 1
* = total number of observations
* = fraction
*For the following set of data: 12, 14, 15, 18, 20, 21, 23, 24
*Estimate the 62.5% percentile
= 0.625 8 + 0.5 = 5.5

0
.
625
=
20 + 21
2
= 20.5
*Measures of Variability
*Variability reduction can be a challenging task.
*Cannot make decisions focusing only on central tendency.
*Sometimes referred to as measures of spread.
*Variability is always our ENEMY!
* It is impossible to get rid of all variability.


4


Process 1: 85 86 85 87 85.75
Process 2: 95 85 80 84 87.5
*Variability
* Given a sample of n observations

*Sample variance
* Sensitive to outliers
* Most widely used
* Degrees of freedom(df): n 1
* Lost one df estimating the sample mean
*

= 0

=1
, only n-1 deviations from the mean are freely determined.

*Sample standard deviation

*Range / Interquartile Range (IQR)
( )
1 1

1
2
1 2
1
2
2 2

|
.
|

\
|

= =

=
=
=
n
n
x
x
n
x x
s
n
i
n
i
i
i
n
i
i
o
2
s = o
( ) ( ) 1 3 ; Q Q IQR x min x max r
i i
= =
*Team Exercise # 1
*Given the following set of observations:
*83, 27, 80, 84, 75, 76, 110, 78, 86, 87

1. Calculate all measures of central tendency
*Mean
*10% Trimmed Mean
*Median
*Weighted average - smallest and largest observations get half
the weight of the remaining observations






*Why use graphical summaries?
*Can easily convey information about:
1. Central tendency
2. Spread / Variability
3. Shape




Source: oiip.uprm.edu/
*
*Simple graphical displays
*Scatter Plot
*Dot Diagram
*Histograms
*Stem and leaf
*Box plots
*Time series
*Shape
*Skewness
* A symmetric distribution can be folded along a vertical axis so that the two sides
coincide.
* Skewness measures the lack of symmetry with respect to a vertical axis.
* Associated with long tails.
* Skewed right (Positive Skewed) long tail toward the right.
* Skewed left (Negative Skewed) long tail toward the left.
*Kurtosis
* Measures the peakedness of a distribution.
* A Normal distribution has a kurtosis of 3.
* More peaked: Kurtosis > 3
* Less peaked: Kurtosis < 3





*Shape - 2
A. Unimodal - one mode
B. Bimodal - two modes
C. Multimodal - multiple modes





A
B
C
*Shape - Example
*Describe the shape of the following grade distributions
*Scatter Diagram
0
5
10
15
20
25
10 15 20 25 30 35
T
e
n
s
i
l
e

S
t
r
e
n
g
t
h

Cotton Percentages
*Time Series Plot
*Shows observed values for a given variable along with a time stamp
* Vertical axis observed values
* Horizontal axis time
*Conveys information about changes in central tendency and/or
variability over time.
0 50 100 150 200
-
2
0
2
4
6
8
1
0
t
0 50 100 150 200
-
4
-
2
0
2
4
t
*Dot Diagram
*Often used with discrete data.
*Observations: 1, 5, 3, 5, 1, 2, 1, 1, 2, 1, 4, 5, 3, 3, 4, 1, 3, 2, 4, 1, 5, 2,
3, 1, 3, 1, 2, 2, 3, 5, 4, 3, 2, 4, 2, 5, 4, 1, 2, 5, 1, 3, 5, 3, 4, 5, 3, 5, 1, 4


1 2 3 4 5
Observed values
*Histogram
Steps:
1. Divide the range of the data into intervals (or bins)
* Common choices
* Number of bins is equal to n
* Number of bins between 5 - 20
2. Select the form of the intervals
* Equal size [ = (max min ) / ]
* Equal probability
3. Define y-axis
* Absolute frequency counts
* Relative frequency percentage
* Cumulative frequency either in the form of counts or percentages


*Note on Histograms
*Right Closed vs. Left Closed
*Suppose you have data between 0 and 100.
*You intend to use 10 bins in a histogram.
Where does 10 go?
F
r
e
q
u
e
n
c
y
0 20 40 60 80 100
0
2
4
6
8
1
0
1
2
*Histograms in Quality Control (QC)
*Pareto analysis is the use of histograms to identify
improvement opportunities.
*In QC, Pareto analysis is used to identify quality costs by
category, or by product, or by type of detect of non-
conformity.
*80/20 Rule of Thumb:
*Vital few
*Find the few problems that drive the majority of the quality costs

$-
$5,000
$10,000
$15,000
$20,000
$25,000
$30,000
$35,000
$40,000
Insufficient solder Misaligned
components
Defective components Missing components Cold solder joints All other causes
Monthly quality-costs information for assembly of printed
circuit boards
*Histogram: Example 1
-10 0 10 20
0
2
4
6
8
1
0
Observed values
-10 0 10 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
Observed values
Absolute Frequency Relative Frequency
*Histogram: Example 1 (Contd)
Observed Values
F
r
e
q
u
e
n
c
y
-10 -5 0 5 10 15 20
0
1
0
2
0
3
0
4
0
5
0
Cumulative Frequency
*Histogram: Example 2
*Central tendency / Dispersion ?
Observed values
F
r
e
q
u
e
n
c
y
-15 -10 -5 0 5 10 15 20
0
2
4
6
8
1
0
1
2
Observed values
F
r
e
q
u
e
n
c
y
-15 -10 -5 0 5 10 15 20
0
5
1
0
1
5
2
0
2
5
Simulated with the same mean but different variance.
Which sample has the largest variability?
A = Left B = Right C = Cant tell
*Histogram: Example 3
Observed values
F
r
e
q
u
e
n
c
y
-15 -10 -5 0 5 10 15 20
0
2
4
6
8
Observed values
F
r
e
q
u
e
n
c
y
-15 -10 -5 0 5 10 15 20
0
2
4
6
8
1
0
*Central tendency / Dispersion ?
Simulated with the same variance and a different mean.
*Histogram: Example 4
Observed values
F
r
e
q
u
e
n
c
y
-15 -10 -5 0 5 10 15 20
0
5
1
0
1
5
2
0
2
5
Observed values
F
r
e
q
u
e
n
c
y
-15 -10 -5 0 5 10 15 20
0
2
4
6
8
1
0
*Central tendency / Dispersion ?
Simulated with different mean and different variance.
Which sample has the largest mean? Variability?
A = Left B = Right C = Cant tell
*Stem-and-Leaf
1. Order data
2. Divide data into two parts
*Stem one or more of the leading digits
*Leaf remaining digits
3. List the stem values in a vertical column
*Sometimes break the stem in upper (U) and lower (L) half
4. Record each leaf beside its stem
5. Include units for stems and leaves on display
6. OPTIONAL: Keep a tally of the number of leaves per stem
*Stem-and-Leaf: Nicotine Data
Nicotine content in cigarettes
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
*Stem-and-Leaf Example
*Sample:
81 63 54 89 34 60 89 19 70 59 40 28 42 59 72 32 31 10 24 17
*Ordered data = 20 :
*10 17 19 24 28 31 32 34 40 42 54 59 59 60 63 70 72 81 89 89


1 0 7 9 3
2 4 8 2
3 1 2 4 3
4 0 2 2
5 4 9 9 3
6 0 3 2
7 0 2 2
8 1 9 9 3
*Box Plots
*Convey information about central tendency, dispersion, and
shape
*Central tendency: median
*Variability: interquartile range (IQR)
*Additional summary statistics: min, max, *mean*

1
4
3 2
6
5 7
*Outliers in Box Plots
*Outliers?
*Any point larger than 3 +1.5 ?
*Any point smaller than 1 1.5 ?
*Extreme outliers?
*Any point larger than 3 +3 ?
*Any point smaller than 1 3 ?


*Link to traditional tools
Source: Managing, Controlling, and Improving Quality (2011)
*Link to current literature
Source: IEEE/ACS International Conference on Computer Systems and Applications, 2009.
*Team Exercise # 2
1. Calculate
41
for the following observations:
* 19, 10, 15, 14, 21, 12, 13, 7, 13, 11, 9
2. Calculate the min, mean, median, and max for the following observations:
* 19, 10, 15, 14, 21, 27, 20, 25
3. Calculate the and
2
for the following observations:19, 10, 15, 14, 21, 12, 13
4. Calculate 1 and 3 for the following observations: 19, 10, 15, 14, 21, 27, 20, 25
*Box Plot Example
*Sample 1: 4,6,8,13,15,16,20
*Sample Mean: ?
*Median (Q2): ?
*First quartile (Q1): ?
*Third quartile (Q3): ?
*IQR: ?



*Sample 2: 4,6,8,13,15,16,20,36
*Sample Mean: ?
*Median (Q2): ?
*First quartile (Q1): ?
*Third quartile (Q3): ?
*IQR: ?

*Box Plot Example - 2
*Sample 1: 4,6,8,13,15,16,20
*Sample Mean: 11.71
*Median (Q2): 13
*First quartile (Q1): 6
*Third quartile (Q3): 16
*IQR: 10


*Sample 2: 4,6,8,13,15,16,20,36
*Sample Mean: 14.75
*Median (Q2): 14
*First quartile (Q1): 7
*Third quartile (Q3): 18
*IQR: 11

5
1
0
1
5
2
0
2
5
3
0
3
5
*Box Plot Example - 3
5
1
0
1
5
2
0
Sample 1 (S1) Sample 2 (S2)
1 = 6
3 = 16
2 = 13
1 = 7
3 = 18
2 = 14
4
20
4
20
36
Which sample has the largest variability?
Which is symmetric?



A = S1 B = S2 C = None
A = S1 B = S2 C = Cant tell
*
a. b. c. d. e.
0% 0% 0% 0% 0%
a.Mean
b.Median
c. Mode
d.Standard deviation
e.Skewness
10
*
a. b.
0% 0%
a. TRUE
b. FALSE
10
*
a. b.
0% 0%
a.TRUE
b.FALSE
10
*
a. b. c. d. e.
0% 0% 0% 0% 0%
a.IQR
b.Range
c. Variance
d.Standard
deviation
e. Weighted
average
10
a. b.
0% 0%
a.TRUE
b. FALSE
10
*
a. b. c. d.
0% 0% 0% 0%
a.Standard deviation
b.Variance
c. Range
d.Interquartile range
10
*
0%
0%
0%
S1 S2
5
1
0
1
5
2
0
2
5
3
0
3
5
a.S1
b.S2
c. Both have approx. the same variability.
10
*
0%
0%
0%
a.Symmetric
b.Skewed right
c. Skewed left
0
.
0
0
.
1
0
.
2
0
.
3
10
*
0%
0%
0%
a.Symmetric
b.Skewed right
c. Skewed left
0
.
7
0
0
.
7
5
0
.
8
0
0
.
8
5
0
.
9
0
0
.
9
5
1
.
0
0
10
*
0%
0%
a.Yes
b.No
0
1
2
3
4
0
1
2
3
4
10
*
0%
0%
a.Yes
b.No
0
1
2
3
4
0
1
2
3
4
10
*
a. b. c. d. e. f.
0% 0% 0% 0% 0% 0%
a.Standard deviation
b.Mean
c. Variance
d.Median
e.5
th
percentile
f. Range
10
*
a. b. c. d. e. f.
0% 0% 0% 0% 0% 0%
a.Standard deviation
b.Mean
c. Variance
d.Median
e.5
th
percentile
f. Range
10

You might also like