Professional Documents
Culture Documents
Slide 2
Motivation
To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
2
Slide 3
The arithmetic mean, often referred to as simply the mean or average when the context is clear
The weighted mean is similar to an arithmetic mean (the most common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others.
Trimmed mean: A method of averaging that removes a small percentage of the largest and smallest values before calculating the mean. After removing the specified observations, the trimmed mean is found using an arithmetic averaging formula.
Slide 4
Arithmetic Mean
w x
i i =1 n
wi
i =1
Slide 5
Median
The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. A holistic measure (Emphasizing the importance of the whole and the interdependence of its parts.)
Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): interpolation is a method of constructing new data points within the range of a discrete set of known data points. n / 2 ( f )l md n e ia = L1 + ( )c f md n e ia
where L1 is the lower boundary of the median interval, N is the number of values in the entire data set, ( f req)l is the sum of the frequencies of all of the intervals that are lower than the median interval, f reqmedian is the frequency of the median interval, and width is the width of the median interval.
Slide 6
Mode
The mode is the value that occurs most frequently in a data set or a probability distribution. Mode
Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:
Slide 7
Symmetric Data
Symmetry is implied when data values are distributed in the same way above and below the middle of the sample.
are easily interpreted; allow a balanced attitude to outliers, that is, those above and below the middle value ( median) can be considered by the same criteria;
Many standard statistical techniques are appropriate only for a symmetric distributional form. For this reason, attempts are often made to transform skew-symmetric data so that they become roughly symmetric.
Slide 8
Skewed Data
Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the distribution tend to be further from the 'middle' than values on the other side.
For skewed data, the usual measures of location will give different values, for example, mode<median<mean would indicate positive (or right) skewness.
Positive (or right) skewness is more common than negative (or left) skewness.
If there is evidence of skewness in the data, we can apply transformations, for example, taking logarithms of positive skew data.
Slide 9
Median, mean and mode of symmetric, positively and negatively skewed data
The degree to which numerical data tend to spread is called the dispersion, or variance of the data
The most common measures of data dispersion are range, the five-number summary (based on quartiles), the interquartile range, and the standard deviation.
Let x1;x2;.;xN be a set of observations for some attribute. The range of the set is the difference between the largest (max()) and smallest (min()) values The most commonly used percentiles (it is the value of a variable below which a certain percent of observations fall) other than the median are quartiles.
The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi. The median is the 50th percentile. The first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile. The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data.
Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 Q1 Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
2 =
1 N
(x
i =1
)2 =
Outlier: usually, a value higher/lower than 1.5 x IQR Variance: (algebraic, scalable computation)
1 N
x
i =1
2 i
s2 =
1 n 1 n 1 n ( xi x ) 2 = n 1 [ xi 2 n ( xi )2 ] n 1 i =1 i =1 i =1
The normal (distribution) curve From to +: contains about 68% of the measurements (: mean, : standard deviation) From 2 to +2: contains about 95% of it From 3 to +3: contains about 99.7% of it
13
Boxplot Analysis
Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
14
16
Histogram Analysis
17
Quantile Plot
Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
19
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another
20
Scatter plot
Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
21
Loess Curve
Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression
23
Histogram: (shown before) Boxplot: (covered before) Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence
24