Professional Documents
Culture Documents
Universitas Indonesia
2012
Data Mining
More data is
generated:
Bank, telecom, other
The Data
Gap
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
Data
Definition
Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon
Se
& lect
Cl io
ea n
nin
g
DATA
Ware
house
ma
tio
Mi
nin
Knowledge
__ __ __
__ __ __
__ __ __
Transformed
Data
Target
Data
Knowledge
Patterns
and
Rules
Understanding
Raw
Data
Tra
ns
f or
Interpretation
& Evaluation
Clustering
Classification
Association Rules
Other Methods:
Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data
mining, web mining
10
11
Example:
{bread} {butter, cheese}
{onion, tomato} {salt}
12
cardinality of D.
Confidence of a rule P Q
cDP Q) = sD(P Q)/sD(P)
percentage of transactions that contain both P and
Thresholds:
Frequent
itemset P
support
Strong
rule P Q (c%)
(P Q) frequent,
c is larger than minimum confidence
14
Transaction
ID
Items
Bought
2000
A, B, C
1000
A, C
Frequent
Itemset
4000
A, D
{A}
75%
5000
B, E, F
{B}
50%
{C}
50%
{A,C}
50%
Support
Input
A database of transactions
Each transaction is a list of items (Ex. purchased by
a customer in a visit)
The
Find
be a frequent itemset
i.e., if {AB} isa frequent itemset, both {A} and
{B} should be a frequent itemset
TID
List of Items
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
Consider a database, D,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22%).
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using
apriori algorithm.
Then, association rules will
be generated using min.
support & min. confidence
Itemset
{l1}
{l2}
Compare
Sup.Count candidate
support count
6
with minimum
support count
7
Itemset
Sup.Count
{l1}
{l2}
{l3}
{l3}
{l4}
{l4}
{;5}
{;5}
C1
L1
The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
In the first iteration of the algorithm, each item is member of the set of
candidate
Generate
C2
candidat
es from
L1
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
{l2,l3}
{l2,l3}
{l2,l4}
{l2,l4}
{l2,l5}
{l2,l5}
{l3,l4}
{l3,l4}
{l3,l5}
{l3,l5}
{l4,l5}
C2
{l4,l5}
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
Scan D for
count of
each
candidate
C2
Compare
candidate
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2}
{l1,l3}
{l1,l5}
{l2,l3}
{l2,l4}
{l2,l5}
2
L2
Generate
C3
candidat
es from
L2
Itemset
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
{l1,l2,l3}
{l1,l2,l5}
C3
C3
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
2
L3
The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2JoinL2.
C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
nonempty subsets of I.
For every nonempty subset s of I, output the rule
s (I-s) if support_count(I) /
support_count(s) min_conf where min_conf
is minimum confidence threshold.
In
our example:
We had L = {{l1},{l2},{l3},{l4},{l5},
{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},
{l2,l5},{l1,l2,l3},{l1,l2,l5}}.
Lets take I = {l1,l2,l5}
Its all nonempty subsets are {l1,l2}, {l1,l5},
{l2,l5}, {l1}, {l2}, {l5}
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
32
ca
go
e
t
l
ir ca
ca
go
e
t
l
ir ca
tin
n
co
u
uo
ss
a
l
c
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75K
100K
No
Yes
Married
50K
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
Taxable
Income Cheat
Yes
Single
125K
No
Married
No
60K
10
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
33
ca
go
e
t
l
a
c
ri
ca
go
e
t
l
a
c
ri
tin
n
co
us
o
u
ss
a
cl
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
te
ca
10
ica
r
go
te
ca
ica
r
go
l
tin
n
co
us
o
u
ss
a
l
c
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, dont buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such
customers.
Type of business, where they stay, how much they earn, etc.
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Customer
Attrition/Churn:
to be lost to a competitor.
Approach:
Use detailed record of transactions with each of
the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital
status, etc.
40
low.
Partitioning
algorithms: Construct
various partitions and then evaluate
them by some criterion.
Hierarchy algorithms: Create a
hierarchical decomposition of the set
of data (or objects) using some
criterion. There is an agglomerative
approach and a divisive approach.
43
44
cluster
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
3
2
1
1
1
2
3
4
5
0
2
10
9
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
4
3
2
1
1
1 0
2 2
3 6
4 10
5 9
(1,2) 3 4 5
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
4
3
2
1
+ tight clusters
- slow
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
4
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
4
3
2
1
54
UnderstandingtheData
Data Cleaning
Missing Values, Noisy Values, Outliers
Dates
Nominal/Numeric
Discretization
Normalization
Smoothing
Transformation
Attribute selection
55
Can'tbeexpectedtobeexpertinallfields,but
understandingthedatacanbeextremelyuseful
fordatamining.
What data is available?
What available data is actually relevant or useful?
Can the data be enriched from other sources?
Are there historical datasets available?
Who is the Real expert to ask questions of?
(Are the results at all sensible? Or are they completely
obvious?)
Numberofinstancesavailable:
5000 or more for reliable results
Number
of attributes:
sampling
57
distribution?
Analyze data elements: check inconsistencies,
redundant, missing values, outlier, etc.
Conduct physical audit: ensure data recorded
properly, for example: cross check data to
customer
Analyze business rules: check data violates
business rules
Exclude
Wewantalldatestobethesame.YYYYMM
DDisanISOstandard,BUTithassomeissues
fordatamining.
Year 10,000 AD! (we only have 4 digits)
Dates BC[E] eg -0300-02-03 is not a valid YYYY-MM-
DD date.
Most importantly: Does not preserve intervals,
with/without the second
Other representations:
Posix/UnixSystemDate:Numberofsecondssince1
970
etc
60
(eg gender)
Ordered fields: Convert to numbers to preserve order (eg A vs C
grade becomes 4 and 2 respectively)
Few Values: Convert each value into a new binary attribute, for
example: possible values for attribute AT are A, B, C, D then you can
create 4 new attributes ATa, ATb, ATc, ATd with each attribute has
value either 0 or 1
Many Values: Convert into groups of values, each with its own
(binary) attribute. eg group states in the US into 5 groups of 10.
Unique Values: Ignore identifier like attributes (buang atribut)
61
Equal Width
Equal Depth
Class Dependent
Entropy
Fuzzy
Min/Max Normalisation:
v'= (v-min)/(max-min) * (newMax-newMin) +
newMin
Eg: 73600 in [12000,98000] to [0,1] is 0.716
Transform
form
For example:
Birth date is transformed to Age
Date of the first transaction is
What
ID
T1
T2
T3
T4
T5
Cance
r?
P1
Yes
P2
No
P3
No
P4
Yes
P5
No
Pada
Apa
Pak