Professional Documents
Culture Documents
Techniques
— Chapter 2 —
timeout
season
coach
score
game
Data matrix, e.g., numerical
team
ball
lost
pla
wi
n
y
matrix, crosstabs
Document data: text Document 1 3 0 5 0 2 6 0 2 0 2
vector Document 3 0 1 0 0 1 2 2 0 3 0
Transaction data
TID Items
Graph
World Wide Web 1 Bread, Coke, Milk
2 Beer, Bread
Social or information
networks 3 Beer, Coke, Diaper, Milk
Molecular Structures 4 Beer, Bread, Diaper, Milk
Ordered 5 Coke, Diaper, Milk
Spatial data: maps
Temporal data: time-series
Sequential Data: transaction
sequences
August 10, 2009 Data Mining: Concepts and Techniques 4
Important Characteristics of
Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Similarity
Distance measure
codes
Ordinal
E.g., rankings (e.g., army, professions), grades,
Interval
E.g., calendar dates, body temperatures
Ratio
E.g., temperature in Kelvin, length, time, counts
August 10, 2009 Data Mining: Concepts and Techniques 6
Discrete vs. Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of
values
E.g., zip codes, profession, or the set of words in
a collection of documents
Sometimes, represented as integer variables
discrete attributes
Continuous Attribute
as floating-point
August 10, 2009 variables
Data Mining: Concepts and Techniques 7
Chapter 2: Data
Preprocessing
General data characteristics
Basic data description and exploration
Measuring data similarity
Data cleaning
Data integration and transformation
Data reduction
Summary
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
August 10, 2009 Data Mining: Concepts and Techniques 9
Measuring the Central Tendency
1 n
x = ∑xi µ= ∑x
Mean (algebraic measure) (sample vs. population):
n i =1 N
n
Weighted arithmetic mean: ∑w x i i
x= i =1
Trimmed mean: chopping extreme values n
∑w
i =1
i
Median: A holistic measure
Middle value if odd number of values, or average of the
middle two values otherwise
Estimated by interpolation (for grouped data):
N / 2 − (∑ freq)l
positively negatively
skewed skewed
N i =1 N
∑ xi − µ 2
i =1
2
A univariate graphical method
Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given
data
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of C(k, 2) = (k2 ̶ k)/2
scatterplots]
August 10, 2009 Data Mining: Concepts and Techniques 28
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
• • •
census data
showing age,
income, sex,
education, etc.
attribute 4
attribute 2
attribute 3
attribute 1
cardinality
AugustBut,
10, 2009difficult to display moreandthan nine dimensions
Data Mining: Concepts Techniques 36
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
August 10, 2009 Data Mining: Concepts and Techniques 37
Tree-Map
Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on
the attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values
(classes)
are
Value is higher when objects are more alike
objects
Lower when objects are more alike
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6 Data Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)
are two p-dimensional data objects, and q is the
order
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) ≤ d(i, k)Data
August 10, 2009
+Mining:
d(k,Concepts
j) (Triangle
and Techniques
Inequality) 44
Special Cases of Minkowski Distance
q = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that
are different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Standardize data
standard deviation
distance
August 10, 2009 Data Mining: Concepts and Techniques 47
Binary Variables Object j
1 0 sum
1 a b a +b
A contingency table for binary Object i
data
0 c d c+d
sum a + c b + d p
Distance measure for d (i, j) = b +c
symmetric binary variables:
a +b +c +d
Distance measure for d (i, j) = b +c
asymmetric binary variables: a +b +c
Jaccard coefficient (similarity simJaccard (i, j) = a
measure for asymmetric binary a+b+c
variables):
Note: Jaccard coefficient is the same as “coherence”:
coherence(i, j) = sup(i, j) = a
sup(i) +sup( j) −sup(i, j) (a +b) + (a + c) − a
August 10, 2009 Data Mining: Concepts and Techniques 48
Dissimilarity between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set
to 0 d ( jack , mary ) = 0 +1 =0.33
2 +0 +1
1 +1
d ( jack , jim ) = =0.67
1 +1 +1
1 +2
d ( jim , mary ) = =0.75
1 +1 +2
August 10, 2009 Data Mining: Concepts and Techniques 49
Nominal Variables
effects
p
d
d (i, j) = f = 1 ij ij
Σ pf = 1δ ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
if
zif = r −1
f is ordinal or ratio-scaled
f M −1
Compute ranks r and
August 10, 2009 if Concepts and Techniques
Data Mining: 53
Vector Objects: Cosine
Similarity
Vector objects: keywords in documents, gene features in micro-
arrays, …
Applications: information retrieval, biologic taxonomy, ...
Cosine measure: If d1 and d2 are two vectors, then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of
vector d
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1•d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5
||d1||=
(3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.
5
= 6.481
August 10, 2009 Data Mining: Concepts and Techniques 54
Chapter 2: Data
Preprocessing
General data characteristics
Basic data description and exploration
Measuring data similarity
Data cleaning
Data integration and transformation
Data reduction
Summary
or files
Data transformation
Data reduction
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and
accessibility
deleted
data not entered due to misunderstanding
August
10, 2009 Data Mining: Concepts and Techniques 61
How to Handle Missing
Data?
technology limitation
incomplete data
inconsistent data
frequency) bins
then one can smooth by bin means, smooth by
functions
Clustering
Y1
Y1’ y=x+1
X1 x
distribution)
Check field overloading
Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections
Data auditing: by analyzing data to discover rules and
coherent store
Schema integration: e.g., A.cust-id ≡ B.cust-#
Integrate metadata from different sources
rp ,q =
∑ ( p − p )(q − q ) ∑ ( pq ) − n p q
=
(n − 1)σ pσ q (n − 1)σ pσ q
correlation( p, q) = p′ • q′
Scatter plots
showing the
similarity from
–1 to 1.
construction
Generalization: Concept hierarchy climbing
specified range
min-max normalization
z-score normalization
normalization by decimal scaling
August 10, Attribute/feature
2009 construction
Data Mining: Concepts and Techniques 78
Data Transformation:
Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
Ex. Let income range $12,000 to $98,000
73,600 − 12,000 normalized to
(1.0 − 0) + 0 = 0.716
98,000 − 12
[0.0, 1.0]. Then $73,000 is mapped to
,000
73,600 − 54,000
= 1.225
Ex. Let μ = 54,000, σ = 16,000. Then16,000
Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
August 10, 2009 Data Mining: Concepts and Techniques 79
Chapter 2: Data
Preprocessing
General data characteristics
Basic data description and exploration
Measuring data similarity
Data cleaning
Data integration and transformation
Data reduction
Summary
data set that is much smaller in volume but yet produce the
same (or almost the same) analytical results
Data reduction strategies
attributes
Numerosity reduction (some simply call it: Data
Reduction)
Data cub aggregation
Data compression
Regression
Discretization
August 10, 2009 (and
Data concept
Mining: hierarchy
Concepts and Techniques generation) 81
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly
sparse
Density and distance between points, which is critical to
exponentially
Dimensionality reduction
Avoid the curse of dimensionality
x1
August 10, 2009 Data Mining: Concepts and Techniques 83
Principal Component Analysis
(Steps)
Given N data vectors from n-dimensions, find k ≤ n
orthogonal vectors (principal components) that can be best
used to represent data
Normalize input data: Each attribute falls within the same
range
Compute k orthonormal (unit) vectors, i.e., principal
components
Each input data (vector) is a linear combination of the k
principal component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can
be reduced by eliminating the weak components, i.e.,
those with low variance
August 10, 2009 (i.e., using
Data Mining: Concepts the strongest principal
and Techniques 84
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information
contained in one or more other attributes
E.g., purchase price of a product and the
amount of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task
of predicting students' GPA
...
Step-wise feature elimination:
Feature extraction
domain-specific
Mapping data to new space (see: data
reduction)
E.g., Fourier transformation, wavelet
transformation
Feature construction
Combining features
August 10, 2009
Data discretization
Data Mining: Concepts and Techniques 87
Mapping Data to a New Space
Fourier transform
Wavelet transform
sampling
August 10, 2009 Data Mining: Concepts and Techniques 89
Parametric Data Reduction:
Regression and Log-Linear
Models
Linear regression: Data are modeled to fit a
straight line
Often uses the least-square method to fit the
line
Multiple regression: allows a response variable Y
to be modeled as a linear function of
multidimensional feature vector
Log-linear model: approximates discrete
multidimensional
August 10, 2009 Data probability distributions
Mining: Concepts and Techniques 90
Regress Analysis and Log-
Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the
Image
algorithms
Typically lossless
expansion
Audio/video compression
refinement
Sometimes small fragments of signal can be
s s y
lo
Original Data
Approximated
90000
20000
40000
70000
50000
10000
30000
60000
80000
100000
bucket represents)
August MaxDiff: set bucketData Mining: Concepts and Techniques
10, 2009 97
Data Reduction Method:
Clustering
Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid
and diameter) only
Can be very effective if data is clustered but not if
data is “smeared”
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions
and clustering algorithms
Cluster analysis will be studied in depth in
Chapter 7
August 10, 2009 Data Mining: Concepts and Techniques 98
Data Reduction Method: Sampling
particular item
Sampling without replacement
the population
Sampling with replacement
population
Stratified sampling:
W O R
SRS le random
i m p h o u t
( s e wi t
l
samp ment)
p l a c e
re
SRSW
R
Raw Data
August 10, 2009 Data Mining: Concepts and Techniques 101
Sampling: Cluster or Stratified
Sampling
Discretization
Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals
Interval labels can then be used to replace actual data
values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing
low level concepts (such as numeric values for age) by
August 10, 2009 Data Mining: Concepts and Techniques 104
Discretization and Concept Hierarchy
Generation for Numeric Data
Typical methods: All the methods can be applied recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split
Interval merging by χ2 Analysis: unsupervised, bottom-up
merge
Segmentation by natural partitioning: top-down split,
August 10, 2009 Data Mining: Concepts and Techniques 105
Labels
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
year
country 15 distinct values
Brute-force approach:
Try all possible feature subsets as input to data
mining algorithm
Embedded approaches:
Feature selection occurs naturally as part of the
algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to