Chapter 2 2 2 Data Mining

Slide 1
2.2 Descriptive data summarization
Slide 2
Mining Data Descriptive Characteristics
Motivation
To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
2
Data dispersion characteristics
Numerical dimensions correspond to sorted intervals
Dispersion analysis on computed measures

Slide 3
Measuring the Central Tendency Mean
The arithmetic mean, often referred to as simply the mean or average when the context is clear
The weighted mean is similar to an arithmetic mean (the most common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others.
Trimmed mean: A method of averaging that removes a small percentage of the largest and smallest values before calculating the mean. After removing the specified observations, the trimmed mean is found using an arithmetic averaging formula.
Slide 4
Mean (algebraic measure) (sample vs. population):
Arithmetic Mean
where x1;x2; : : : ;xN be a set of N values or observations
Weighted arithmetic mean:
Trimmed mean: chopping extreme values

x=
w x
i i =1 n
wi
i =1
Slide 5
Median
The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. A holistic measure (Emphasizing the importance of the whole and the interdependence of its parts.)
Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): interpolation is a method of constructing new data points within the range of a discrete set of known data points. n / 2 ( f )l md n e ia = L1 + ( )c f md n e ia
where L1 is the lower boundary of the median interval, N is the number of values in the entire data set, ( f req)l is the sum of the frequencies of all of the intervals that are lower than the median interval, f reqmedian is the frequency of the median interval, and width is the width of the median interval.
Slide 6
Mode
The mode is the value that occurs most frequently in a data set or a probability distribution. Mode

Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:
mean mode = 3 (mean median)
Slide 7
Symmetric Data
Symmetry is implied when data values are distributed in the same way above and below the middle of the sample.
Symmetrical data sets:

are easily interpreted; allow a balanced attitude to outliers, that is, those above and below the middle value ( median) can be considered by the same criteria;
allow comparisons of spread or dispersion with similar data sets.
Many standard statistical techniques are appropriate only for a symmetric distributional form. For this reason, attempts are often made to transform skew-symmetric data so that they become roughly symmetric.
Slide 8
Skewed Data
Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the distribution tend to be further from the 'middle' than values on the other side.
For skewed data, the usual measures of location will give different values, for example, mode<median<mean would indicate positive (or right) skewness.
Positive (or right) skewness is more common than negative (or left) skewness.
If there is evidence of skewness in the data, we can apply transformations, for example, taking logarithms of positive skew data.
Slide 9
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, positively and negatively skewed data
2.2.2 Measuring the Dispersion of Data
The degree to which numerical data tend to spread is called the dispersion, or variance of the data
The most common measures of data dispersion are range, the five-number summary (based on quartiles), the interquartile range, and the standard deviation.
Let x1;x2;.;xN be a set of observations for some attribute. The range of the set is the difference between the largest (max()) and smallest (min()) values The most commonly used percentiles (it is the value of a variable below which a certain percent of observations fall) other than the median are quartiles.
The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi. The median is the 50th percentile. The first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile. The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data.
Measuring the Dispersion of Data
Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 Q1 Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
2 =
1 N
(x
i =1
)2 =
Outlier: usually, a value higher/lower than 1.5 x IQR Variance: (algebraic, scalable computation)
1 N
x
i =1
2 i
Variance and standard deviation ( sample: s, population: )
s2 =
1 n 1 n 1 n ( xi x ) 2 = n 1 [ xi 2 n ( xi )2 ] n 1 i =1 i =1 i =1
Standard deviation s (or ) is the square root of variance s2 (or 2) 12
Properties of Normal Distribution Curve
The normal (distribution) curve From to +: contains about 68% of the measurements (: mean, : standard deviation) From 2 to +2: contains about 95% of it From 3 to +3: contains about 99.7% of it
13
Boxplot Analysis
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot

Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
14
Visualization of Data Dispersion: Boxplot Analysis
Data Mining: July 17, 2011Concepts and Techniques
16
Histogram Analysis
Graph displays of basic statistical class descriptions Frequency histograms

A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
17
Quantile Plot

Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
19
Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another
20
Scatter plot

Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
21
Loess Curve

Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression
23
Graphic Displays of Basic Statistical Descriptions

Histogram: (shown before) Boxplot: (covered before) Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence
24

Chapter 2 2 2 Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 2 2 Data Mining

Uploaded by

Copyright:

Available Formats

Slide 1

2.2 Descriptive data summarization

Mining Data Descriptive Characteristics

Data dispersion characteristics

Numerical dimensions correspond to sorted intervals

Dispersion analysis on computed measures

Measuring the Central Tendency Mean

Mean (algebraic measure) (sample vs. population):

where x1;x2; : : : ;xN be a set of N values or observations

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

mean mode = 3 (mean median)

Symmetrical data sets:

allow comparisons of spread or dispersion with similar data sets.

Symmetric vs. Skewed Data

2.2.2 Measuring the Dispersion of Data

Measuring the Dispersion of Data

Quartiles, outliers and boxplots

Variance and standard deviation ( sample: s, population: )

Standard deviation s (or ) is the square root of variance s2 (or 2) 12

Properties of Normal Distribution Curve

Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot

Visualization of Data Dispersion: Boxplot Analysis

Data Mining: July 17, 2011Concepts and Techniques

Graph displays of basic statistical class descriptions Frequency histograms

Quantile-Quantile (Q-Q) Plot

Graphic Displays of Basic Statistical Descriptions

You might also like