Professional Documents
Culture Documents
Graph
To Examine relationships between pairs of variables
Use Scatterplot, Matrix Plot, or Marginal Plot
To Examine and compare distributions
Use Histogram, Dotplot, Stem-and-Leaf, Probability Plot, Empirical CDF, or
Boxplot
To Compare summaries or individual values of a variable
Use Boxplot, Interval Plot, Individual Value Plot, Bar Chart, or Pie Chart
To Plot a series of data over time
Use Time Series Plot, Area Graph, or Scatterplot
To Examine relationships among three variables
Use Contour Plot, 3D Scatterplot, or 3D Surface Plot
1
3/24/16
■ Bar Chart
– If the bar chart is designed specifically for comparison
purposes, it is better in the vertical sorted form
■ Bar Chart
– For visualizing more than one categorical variables, the
stacked bar chart, the grouped bar chart, and the spine plot
can be used.
– Plot of the sex and the first letter of the baby names :
2
3/24/16
■ Bar Chart
– The focus of analysis can be shifted from the conditioning
variable to the conditioned variable by using the grouped
bar chart
3
3/24/16
■ Pie Chart
– A pie chart helps us see what part of the whole group forms
– To make a pie chart, you must include all the categories
that make up a whole
■ During the visualization process, a few questions should be asked which are listed
as follows.
– Are there any outliers?
– How many modes?
– What are specific characteristics of shape?
– Is the shape a symmetrical distribution?
– Is the shape a normal distribution?
■ Graph of Continuous Data
– Dotplot. - Histogram
– Scatterplot - Boxplot
– Stem leaf - Time Series Plot - More…
4
3/24/16
Dot Plot
C3
1 3 7
2 4 2
4 4 89
5 5 3
9 5 5789
14 6 02234
18 6 5689
23 7 01234
(9) 7 556778899
18 8 001344
12 8 56789
7 9 0023
3 9 589
5
3/24/16
Histogram
6
3/24/16
Bihistogram
■ The bihistogram is an EDA tool for assessing whether a
before-versus-after engineering modification has caused a
change in
■ location;
■ variation; or
■ distribution.
■ It is a graphical alternative to the two-sample t-test. The
bihistogram can be more powerful than the t-test in that all
of the distributional features (location, scale, skewness,
outliers) are evident on a single plot.
7
3/24/16
Box Plots
■ When you plot several boxplots side by side, to compare the location
and variation of several data sets.
■ Although the medians are all roughly the same, the spread of each data
set is different.
■ The first boxplot shows data that appears to be distributed evenly. The
median is in the middle of the rectangle, and the whiskers are about the
same length. the plot contains no outside values.
■ The median of the second plot from the left appears to be slightly off-
center. The amount of extreme values is a point of concern because it
suggests that the data vary widely.
■ The third boxplot shows data that has less variation and spread than the
other plots.
■ The fourth boxplot shows data that is significantly upwardly-skewed. The
median of this plot is closer to the top of the rectangle than to the
bottom, and the upper whisker is longer than the bottom one.
■ All the boxplots have approximately the same median, and the two
boxplots on the left have approximately the same variation in the data.
8
3/24/16
9
3/24/16
ScatterPlot
10
3/24/16
ScatterPlot Matrix
11
3/24/16
12
3/24/16
Measurement in Statistics
Measures of central tendency Measures of dispersion
■ Mean • Standard deviation (StDev)
■ SumTrimmed mean (TrMean) • Variance
■ Median • Coefficient of variation (CoefVar)
• Range
Measures of position • Interquartile range (IQR)
■ First quartile (Q1)
■ Third quartile (Q3)
Distribution shape
■ Skewness
■ Kurtosis
13
3/24/16
Measures of central
■ Mode
– Used if data are measured along a nominal scale
■ Median
– Used if data are measured along an ordinal or nominal scale
– Used if interval data do not meet requirements for using the mean
■ Mean
– Used if data are measured along an interval or ratio scale
– Most sensitive measure of center
– Used if scores are normally distributed
Mean or Median?
14
3/24/16
15
3/24/16
50%
50%
Measures of Spread
■ Range
– Subtract the lowest from the highest score in a distribution of scores
– Simplest and least informative measure of spread
– Very sensitive to extreme scores
■ Semi-Interquartile Range
– Less sensitive than the range to extreme scores
– Used when you want a simple, rough estimate of spread
16
3/24/16
■ Variance
– Average squared deviation of scores from the mean
■ Standard Deviation
– Square root of the variance
– Most widely used measure of spread
Interpreting the SD
For many lists of observations – especially if their histogram is
bell-shaped
1. Roughly 68% of the observations in the list lie within 1
standard deviation of the average
2. 95% of the observations lie within 2 standard deviations of
the average
Average Ave+2s.d.
Ave-2s.d. Ave-s.d. Ave+s.d.
68%
95%
17
3/24/16
18