You are on page 1of 16

Slide 1

2.2 Descriptive data summarization

Slide 2

Mining Data Descriptive Characteristics

Motivation

To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
2

Data dispersion characteristics

Numerical dimensions correspond to sorted intervals

Dispersion analysis on computed measures


Slide 3

Measuring the Central Tendency Mean

The arithmetic mean, often referred to as simply the mean or average when the context is clear

The weighted mean is similar to an arithmetic mean (the most common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others.

Trimmed mean: A method of averaging that removes a small percentage of the largest and smallest values before calculating the mean. After removing the specified observations, the trimmed mean is found using an arithmetic averaging formula.

Slide 4

Mean (algebraic measure) (sample vs. population):

Arithmetic Mean

where x1;x2; : : : ;xN be a set of N values or observations

Weighted arithmetic mean:

Trimmed mean: chopping extreme values


x=

w x
i i =1 n

wi
i =1

Slide 5

Median

The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. A holistic measure (Emphasizing the importance of the whole and the interdependence of its parts.)

Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): interpolation is a method of constructing new data points within the range of a discrete set of known data points. n / 2 ( f )l md n e ia = L1 + ( )c f md n e ia

where L1 is the lower boundary of the median interval, N is the number of values in the entire data set, ( f req)l is the sum of the frequencies of all of the intervals that are lower than the median interval, f reqmedian is the frequency of the median interval, and width is the width of the median interval.

Slide 6

Mode

The mode is the value that occurs most frequently in a data set or a probability distribution. Mode

Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:

mean mode = 3 (mean median)

Slide 7

Symmetric Data

Symmetry is implied when data values are distributed in the same way above and below the middle of the sample.

Symmetrical data sets:


are easily interpreted; allow a balanced attitude to outliers, that is, those above and below the middle value ( median) can be considered by the same criteria;

allow comparisons of spread or dispersion with similar data sets.

Many standard statistical techniques are appropriate only for a symmetric distributional form. For this reason, attempts are often made to transform skew-symmetric data so that they become roughly symmetric.

Slide 8

Skewed Data

Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the distribution tend to be further from the 'middle' than values on the other side.

For skewed data, the usual measures of location will give different values, for example, mode<median<mean would indicate positive (or right) skewness.

Positive (or right) skewness is more common than negative (or left) skewness.

If there is evidence of skewness in the data, we can apply transformations, for example, taking logarithms of positive skew data.

Slide 9

Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data

2.2.2 Measuring the Dispersion of Data

The degree to which numerical data tend to spread is called the dispersion, or variance of the data

The most common measures of data dispersion are range, the five-number summary (based on quartiles), the interquartile range, and the standard deviation.

Let x1;x2;.;xN be a set of observations for some attribute. The range of the set is the difference between the largest (max()) and smallest (min()) values The most commonly used percentiles (it is the value of a variable below which a certain percent of observations fall) other than the median are quartiles.

The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi. The median is the 50th percentile. The first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile. The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data.

Measuring the Dispersion of Data

Quartiles, outliers and boxplots


Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 Q1 Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
2 =
1 N

(x
i =1

)2 =

Outlier: usually, a value higher/lower than 1.5 x IQR Variance: (algebraic, scalable computation)

1 N

x
i =1

2 i

Variance and standard deviation ( sample: s, population: )

s2 =

1 n 1 n 1 n ( xi x ) 2 = n 1 [ xi 2 n ( xi )2 ] n 1 i =1 i =1 i =1

Standard deviation s (or ) is the square root of variance s2 (or 2) 12

Properties of Normal Distribution Curve

The normal (distribution) curve From to +: contains about 68% of the measurements (: mean, : standard deviation) From 2 to +2: contains about 95% of it From 3 to +3: contains about 99.7% of it

13

Boxplot Analysis

Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot


Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum

14

Visualization of Data Dispersion: Boxplot Analysis

Data Mining: July 17, 2011Concepts and Techniques

16

Histogram Analysis

Graph displays of basic statistical class descriptions Frequency histograms


A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

17

Quantile Plot

Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

19

Quantile-Quantile (Q-Q) Plot


Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another

20

Scatter plot

Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane

21

Loess Curve

Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression

23

Graphic Displays of Basic Statistical Descriptions


Histogram: (shown before) Boxplot: (covered before) Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence
24

You might also like