You are on page 1of 18

3/24/16

VISUAL DATA EXPLORATORY


(VDE)
Kartika Fithriasari

Graph
To Examine relationships between pairs of variables
Use Scatterplot, Matrix Plot, or Marginal Plot
To Examine and compare distributions
Use Histogram, Dotplot, Stem-and-Leaf, Probability Plot, Empirical CDF, or
Boxplot
To Compare summaries or individual values of a variable
Use Boxplot, Interval Plot, Individual Value Plot, Bar Chart, or Pie Chart
To Plot a series of data over time
Use Time Series Plot, Area Graph, or Scatterplot
To Examine relationships among three variables
Use Contour Plot, 3D Scatterplot, or 3D Surface Plot

1
3/24/16

Visualizing Categorical Data

■ Bar Chart
– If the bar chart is designed specifically for comparison
purposes, it is better in the vertical sorted form

■ Bar Chart
– For visualizing more than one categorical variables, the
stacked bar chart, the grouped bar chart, and the spine plot
can be used.
– Plot of the sex and the first letter of the baby names :

Stacked  bar  c hart   Rescaled  Stacked  bar  c hart   Spine  plot

2
3/24/16

■ Bar Chart
– The focus of analysis can be shifted from the conditioning
variable to the conditioned variable by using the grouped
bar chart

■ Frequency data table


– Categorical data summarizing can use a table. Note that
percentages are often called Relative Frequencies.

3
3/24/16

■ Pie Chart
– A pie chart helps us see what part of the whole group forms
– To make a pie chart, you must include all the categories
that make up a whole

Visualizing Continuous Data

■ During the visualization process, a few questions should be asked which are listed
as follows.
– Are there any outliers?
– How many modes?
– What are specific characteristics of shape?
– Is the shape a symmetrical distribution?
– Is the shape a normal distribution?
■ Graph of Continuous Data
– Dotplot. - Histogram
– Scatterplot - Boxplot
– Stem leaf - Time Series Plot - More…

4
3/24/16

Dot Plot

■ Dotplot is the simplest representation of the distribution of the


univariate variable with the lowest level of raw data points showing on
the graph. Though the dotplot gives the most detailed information of
the numerical variable, its representation becomes obscure as the
number of observation increase.
Gambar  4.  Dot  diagram  tinggi  badan  pemain  sepak  bola

1.80 1.85 1.90

C3

Stem and Leaf Plots


■ The stem and leaf display is a form of histogram that retains the actual data values
while showing the shape of their distribution.
■ Tukey (1977) invented the stem and leaf display as a quick means of writing down a
set of data values, producing a data display from which summary values can be
easily calculated.
Stem-and-leaf of Nilai PMS N = 50
Leaf Unit = 1.0

1 3 7
2 4 2
4 4 89
5 5 3
9 5 5789
14 6 02234
18 6 5689
23 7 01234
(9) 7 556778899
18 8 001344
12 8 56789
7 9 0023
3 9 589

5
3/24/16

Histogram

■ The purpose of a histogram is to graphically summarize the distribution of a


univariate data set.
■ The histogram graphically shows the following:
– center (i.e., the location) of the data;
– spread (i.e., the scale) of the data;
– skewness of the data;
– presence of outliers; and
– presence of multiple modes in the data.

6
3/24/16

Bihistogram
■ The bihistogram is an EDA tool for assessing whether a
before-versus-after engineering modification has caused a
change in
■ location;
■ variation; or
■ distribution.
■ It is a graphical alternative to the two-sample t-test. The
bihistogram can be more powerful than the t-test in that all
of the distributional features (location, scale, skewness,
outliers) are evident on a single plot.

■ From the bihistogram, we can


see that batch 1 is centered at a
ceramic strength value of
approximately 725 while batch 2
is centered at a ceramic strength
value of approximately 625. That
indicates that these batches are
displaced by about 100 strength
units. Thus the batch factor has
a significant effect on the
location (typical value) for
strength and hence batch is said
to be "significant" or to "have an
effect".

7
3/24/16

Box Plots

■ Boxplots is used to see the distribution of a single data set or,


■ A boxplot consists of rectangles and lines and shows the median of the data, the upper and
lower quartiles, and any data points that possibly are outside values.
■ The boxplot is a popular EDA tool because of its visual nature and its ability to show you
information quickly.
■ Boxplots provide a schematic graphical summary of important features of a distribution,
including:
– the center
– the spread of the middle of the data (interquartile range)
– the behavior of the tails
– outliers

What’s measurement that can be shown by Boxplot?

■ When you plot several boxplots side by side, to compare the location
and variation of several data sets.
■ Although the medians are all roughly the same, the spread of each data
set is different.
■ The first boxplot shows data that appears to be distributed evenly. The
median is in the middle of the rectangle, and the whiskers are about the
same length. the plot contains no outside values.
■ The median of the second plot from the left appears to be slightly off-
center. The amount of extreme values is a point of concern because it
suggests that the data vary widely.
■ The third boxplot shows data that has less variation and spread than the
other plots.
■ The fourth boxplot shows data that is significantly upwardly-skewed. The
median of this plot is closer to the top of the rectangle than to the
bottom, and the upper whisker is longer than the bottom one.
■ All the boxplots have approximately the same median, and the two
boxplots on the left have approximately the same variation in the data.

8
3/24/16

Boxplot and Stem-and-Leaf

■ A symmetrical boxplot and its corresponding stem-and-leaf


display
■ The boxplot shows that 50% of the data set lies between 2.05
and 2.15. The stem-and-leaf display supports this conclusion.
■ More importantly, the symmetric boxplot suggests that the data
set has a bell-shaped distribution, which the stem-and-leaf
display also supports.
■ By looking at the stem-and-leaf display, you can immediately
determine that the potential outlier at the bottom of the boxplot
has a value of 1.75.

Boxplot and Stem-and-Leaf

■ the boxplot is symmetrical and evenly spaced, suggesting that


the data set has a bell-shaped distribution.
■ The stem-and-leaf display clearly tells another story.
■ It shows that the data is not normally distributed.
■ In fact, the stem-and-leaf display shows two concentrations of
data points, one at each end of the range.

9
3/24/16

Boxplot and Stem-and-Leaf

■ Data sets with flat distributions appear to be distributed


normally on the boxplot as in the figure above.
■ Once again, the stem-and-leaf display corrects this
assumption by showing that all the stems have roughly the
same number of leaves.

ScatterPlot

10
3/24/16

ScatterPlot Matrix

Visualizing Categorical Variable and


Continuous Variable
■ DotPlot
■ BoxPlot

11
3/24/16

12
3/24/16

Measurement in Statistics
Measures of central tendency Measures of dispersion
■ Mean • Standard deviation (StDev)
■ SumTrimmed mean (TrMean) • Variance
■ Median • Coefficient of variation (CoefVar)
• Range
Measures of position • Interquartile range (IQR)
■ First quartile (Q1)
■ Third quartile (Q3)

Distribution shape
■ Skewness
■ Kurtosis

13
3/24/16

Measures of central

■ Mode
– Used if data are measured along a nominal scale
■ Median
– Used if data are measured along an ordinal or nominal scale
– Used if interval data do not meet requirements for using the mean
■ Mean
– Used if data are measured along an interval or ratio scale
– Most sensitive measure of center
– Used if scores are normally distributed

Mean or Median?

■ The mean is a good measure for the center of a symmetric


distribution
■ The median is a resistant measure and should be used for
skewed distributions. Its value is only slightly affected by the
presence of extreme observations, no matter how large
these observations are.

14
3/24/16

Descriptive measures for skewed


distributions
If the histogram of the data is skewed, use the
following descriptive statistics:

Min, Q1, Median, Q3, Max

Descriptive measures for a distribution


is symmetric

Use the average to measure the center and


the Standard Deviation to measure the spread.

15
3/24/16

Mean versus Median


1. The  mean  and  median  of  a  symmetric  distribution  are  close  
together
2. In  skewed  distributions,  the  mean  is  farther  out  in  the  long  tail  than  
is  the  median.  The  mean  is  more  sensitive  to  extreme  values  

Symmetric distribution Right-skewed distribution Left-skewed distribution


50%

50%
50%

Mean Median Median Mean Mean Median

Measures of Spread

■ Range
– Subtract the lowest from the highest score in a distribution of scores
– Simplest and least informative measure of spread
– Very sensitive to extreme scores
■ Semi-Interquartile Range
– Less sensitive than the range to extreme scores
– Used when you want a simple, rough estimate of spread

16
3/24/16

■ Variance
– Average squared deviation of scores from the mean
■ Standard Deviation
– Square root of the variance
– Most widely used measure of spread

Interpreting the SD
For many lists of observations – especially if their histogram is
bell-shaped
1. Roughly 68% of the observations in the list lie within 1
standard deviation of the average
2. 95% of the observations lie within 2 standard deviations of
the average
Average Ave+2s.d.
Ave-2s.d. Ave-s.d. Ave+s.d.

68%

95%

17
3/24/16

Standard Deviation and Variance

■ Standard deviation and variance are sensitive to


outliers, so that other measures are often used.

(Mean  Absolute   Deviation)  [Han]


(Absolute   Average  Deviation)   [Tan]

(Median  Absolute   Deviation)

18

You might also like