Data Mining: Magister Teknologi Informasi Universitas Indonesia

Magister Teknologi Informasi
Universitas Indonesia
2012
Data Mining
More data is
generated:
Bank, telecom, other
business transactions ...

Scientific Data:
astronomy, biology, etc
Web, text, and ecommerce
More data is captured:

Storage technology
faster and cheaper

DBMS capable of
handling bigger DB
We have large data stored in one or more

database/s.
We starved to find new information
within those data (for research usage,
competitive edge, etc).
We want to identify patterns or rules
(trends and relationships) in those data.
We know that a certain data exist inside a
database, but what are the consequences
of that datas existence?
There is often information hidden in the data that is

not readily evident
Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
4,000,000
3,500,000
The Data
Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
Data
Mining is a process of extracting

previously unknown, valid and
actionable information from large
databases and then using the
information to make crucial business
decisions (Cabena et al. 1998).
Data mining: discovering interesting
patterns from large amounts of data
(Han and Kamber, 2001).
Definition
from [Connolly, 2005]:
The process of extracting valid, previously
unknown, comprehensive, and actionable

information from large databases and
using it to make crucial business
decisions.
The
thin red line of data mining: it is

all about finding patterns or rules by
extracting data from large databases
in order to find new information that
could lead to new knowledge.
What is not Data

Mining?
Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon
What is Data Mining?

Certain names are more
prevalent in certain US
locations (OBrien, ORurke,
OReilly in Boston area)
Group together similar
documents returned by search
engine according to their
context (e.g. Amazon
rainforest, Amazon.com,)
Data Mining in Knowledge Discovery

Process
Integration
Da
ta
Se
& lect
Cl io
ea n
nin
g
DATA
Ware
house
ma
tio
Mi
nin
Knowledge
__ __ __
__ __ __
__ __ __
Transformed
Data
Target
Data
Knowledge
Patterns
and
Rules
Understanding
Raw
Data
Tra
ns
f or
Interpretation
& Evaluation
Clustering
Classification
Association Rules
Other Methods:
Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data
mining, web mining

Association rules try to find association

between items in a set of transactions.
For example, in the case of association
between items bought by customers in
supermarket:
90% of transactions that purchase bread
and butter also purchase milk
Antecedent: bread and butter

Consequent: milk
Confidence factor: 90%
10
A transaction is a set of items: T={ia, ib,it}

T I, where I is the set of all possible items
{i1, i2,in}
D, the task relevant data, is a set of
transactions (database of transactions).
Example:
items sold by supermarket (I:Itemset): {sugar,

parsley, onion, tomato, salt, bread, olives, cheese, butter}
Transaction by customer (T): T1: {sugar, onion,

salt}
Database (D): {T1={salt, bread, olives}, T2={sugar,

onion, salt}, T3={bread}, T4={cheese, butter},
T5={tomato}, }
11
An association rule is the form:

P Q, where P I, Q I, and P Q =
Example:
{bread} {butter, cheese}
{onion, tomato} {salt}
12
Support of a rule P Q = Support of (P Q)

in D
sD(P Q ) = sD(P Q)
percentage of transactions in D containing P and Q.
#transactions containing P and Q divided by
cardinality of D.
Confidence of a rule P Q
cDP Q) = sD(P Q)/sD(P)
percentage of transactions that contain both P and
Q in the subset of transactions that contain already

P.
13
Thresholds:
minimum support: minsup

minimum confidence: minconf
Frequent
itemset P
support of P larger than minimum
support
Strong
rule P Q (c%)
(P Q) frequent,
c is larger than minimum confidence
14
Transaction
ID
Items
Bought
Min. support 50%

Min. confidence 50%
2000
A, B, C
1000
A, C
Frequent
Itemset
4000
A, D
{A}
75%
5000
B, E, F
{B}
50%
{C}
50%
{A,C}
50%
For rule {A} {C}:
Support
support = support({A, C}) = 50%

confidence = support({A, C})/support({A}) = 66.6%
For rule {C} {A}:

support = support({A, C}) = 50%
confidence = support({A, C})/support({C}) = 100.0%
15
Input
A database of transactions
Each transaction is a list of items (Ex. purchased by
a customer in a visit)
Find all strong rules that associate the

presence of one set of items with that of
another set of items.
Example: 98% of people who purchase tires and
auto accessories also get automotive services done

There are no restrictions on the number of items in
the head or body of the rule.
The
most famous algorithm is APRIORI

16
Find
the frequent itemsets: the sets of

items that have minimum support
A subset of a frequent itemset must also
be a frequent itemset
i.e., if {AB} isa frequent itemset, both {A} and
{B} should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)

Use
the frequent itemsets to generate

association rules.
Source: [Sunysb, 2009]
TID
List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
Consider a database, D,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22%).
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using
apriori algorithm.
Then, association rules will
be generated using min.
support & min. confidence
Step 1: Generating 1-itemset Frequent Pattern

Scan D
for count
of each
candidate
Itemset
{l1}
{l2}
Compare
Sup.Count candidate
support count
6
with minimum
support count
7
Itemset
Sup.Count
{l1}
{l2}
{l3}
{l3}
{l4}
{l4}
{;5}
{;5}
C1
L1
The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
In the first iteration of the algorithm, each item is member of the set of
candidate
Step 2: Generating 2-itemset

Itemset
Sup.
Frequent
Pattern
Itemset
Count
Generate
C2
candidat
es from
L1
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
{l2,l3}
{l2,l3}
{l2,l4}
{l2,l4}
{l2,l5}
{l2,l5}
{l3,l4}
{l3,l4}
{l3,l5}
{l3,l5}
{l4,l5}
C2
{l4,l5}
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
Scan D for
count of
each
candidate
C2
Compare
candidate
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2}
{l1,l3}
{l1,l5}
{l2,l3}
{l2,l4}
{l2,l5}
2
L2
To discover the set of frequent 2-itemsets, L2,

the algorithm uses L1JoinL1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and
the support count for each candidate itemset
in C2 is accumulated (as shown in the middle
table).
The set of frequent 2-itemsets, L 2, is then
determined, consisting of those candidate 2itemsets in C2 having minimum support.
Note:We havent used Apriori Property yet.

Generate
C3
candidat
es from
L2
Step 3: Generating 3-itemset

Compare
Frequent
Pattern
Scan D
candidate
for count
of each
candidate
Itemset
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
{l1,l2,l3}
{l1,l2,l5}
C3
C3
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
2
L3
The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2JoinL2.
C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
Based on the Apriori property that all subsets

of a frequent itemset must also be frequent, we
can determine that four latter candidates
cannot possibly be frequent.
For example, lets take {I1, I2, I3}. The 2-item
subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are
members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5}
which shows how the pruning is performed. The
2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and

hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2,
I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}}
after checking for all members of result of
Join operation for Pruning.
Now, the transactions in D are scanned in
order to determine L3, consisting of those
candidates 3-itemsets in C having minimum
support.
Step 4: Generating 4-itemset Frequent

Pattern
The algorithm uses L3JoinL3 to generate a
candidate set of 4-itemsets, C4. Although the join
results in {{I1, I2, I3, I5}}, this itemset is pruned
since its subset {{I2, I3, I5}} is not frequent.
Thus, C4= , and algorithm terminates, having
found all of the frequent items. This
completes our Apriori Algorithm.
Whats Next ? These frequent itemsets will be used
to generate strong association rules( where
strong association rules satisfy both minimum
support & minimum confidence).
Step 5: Generating Association Rules

from Frequent Itemsets
Procedure:
For each frequent itemset I, generate all
nonempty subsets of I.
For every nonempty subset s of I, output the rule
s (I-s) if support_count(I) /
support_count(s) min_conf where min_conf
is minimum confidence threshold.
In
our example:
We had L = {{l1},{l2},{l3},{l4},{l5},
{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},
{l2,l5},{l1,l2,l3},{l1,l2,l5}}.
Lets take I = {l1,l2,l5}
Its all nonempty subsets are {l1,l2}, {l1,l5},
{l2,l5}, {l1}, {l2}, {l5}
Let minimum confidence thresholdis , say

70%.
The resulting association rules are shown
below, each listed with its confidence.
R1: {I1,I2} {I5}

Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is Rejected.
R2: {I1,I5} {I2}

R2 is Selected.
R3: {I2,I5} {I1}

R3 is Selected.
R4: {I1} {I2,I5}

Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: {I2} {I1,I5}

Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.
R6: {I5} {I1,I2}

Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.
In this way, We have found three strong

association rules.
Learn a method for predicting the instance class

from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Prepare a collection of records (training set )

Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function of the
values of other attributes (decision tree, neural
network, etc)
Prepare test set to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
After happy with the accuracy, use your model to
classify new instance
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
32
ca
go
e
t
l
ir ca
ca
go
e
t
l
ir ca
tin
n
co
u
uo
ss
a
l
c
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75K
100K
No
Yes
Married
50K
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
Married
No
60K
10
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
33
ca
go
e
t
l
a
c
ri
ca
go
e
t
l
a
c
ri
tin
n
co
us
o
u
ss
a
cl
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
te
ca
10
ica
r
go
te
ca
ica
r
go
l
tin
n
co
us
o
u
ss
a
l
c
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that

fits the same data!
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, dont buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such
customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a

classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997

37
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair

transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.
38
Customer
Attrition/Churn:
Goal: To predict whether a customer is likely
to be lost to a competitor.
Approach:
Use detailed record of transactions with each of
the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital
status, etc.
Label the customers as loyal or disloyal.

Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
39
Clustering is a process of partitioning a set of data

(or objects) in a set of meaningful sub-classes, called
clusters.
Helps users understand the natural

grouping or structure in a data set.
Cluster: a collection of data objects that
are similar to one another and thus
can be treated collectively as one group.
Clustering: unsupervised classification:
no predefined classes
40
Find natural grouping of

instances given un-labeled data
A good clustering method will produce high

quality clusters in which:
the intra-class similarity (that is within a cluster) is high.
the inter-class similarity (that is between clusters) is
low.
The quality of a clustering result also depends on

both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
The quality of a clustering result also depends on
the definition and representation of cluster
chosen.
42
Partitioning
algorithms: Construct
various partitions and then evaluate
them by some criterion.
Hierarchy algorithms: Create a
hierarchical decomposition of the set
of data (or objects) using some
criterion. There is an agglomerative
approach and a divisive approach.
43
Partitioning method: Given a number k,

partition a database D of n objects into a set
of k clusters so that a chosen objective
function is minimized (e.g., sum of distances
to the center of the clusters).
Global optimum: exhaustively enumerate all
partitions too expensive!
Heuristic methods based on iterative
refinement of an initial partition
44
Hierarchical decomposition of the data set (with

respect to a given similarity measure) into a set of
nested clusters
Result represented by a so called dendrogram
Nodes in the dendrogram represent possible clusters
can be constructed bottom-up (agglomerative
approach) or top down (divisive approach)
Clustering obtained by cutting the dendrogram at a desired level: each

connected component forms a cluster.
45
cluster
similarity = similarity of two most

similar members
- Potentially long and
skinny clusters
+ Fast
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
d (1, 2 ), 3 min{ d1 ,3 , d 2 , 3 } min{ 6,3} 3
d (1, 2 ), 4 min{ d1, 4 , d 2 , 4 } min{ 10,9} 9
d (1, 2 ), 5 min{ d1,5 , d 2 , 5 } min{ 9,8} 8
3
2
1
1
1
2
3
4
5
0
2
10
9
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
d (1, 2 , 3 ), 4 min{ d(1, 2 ), 4 , d 3 , 4 } min{ 9,7} 7

d (1, 2 , 3 ),5 min{ d (1, 2 ), 5 , d3 , 5 } min{ 8,5} 5
4
3
2
1
1
1 0
2 2
3 6
4 10
5 9
(1,2) 3 4 5
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
d (1, 2 , 3 ),( 4 , 5 ) min{ d (1, 2, 3), 4 , d (1, 2 , 3 ),5 } 5
4
3
2
1
cluster similarity = similarity of two least

similar members
+ tight clusters
- slow
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
d (1, 2), 3 max{ d1, 3 , d 2, 3} max{ 6,3} 6
d (1, 2), 4 max{ d1 , 4 , d 2 , 4 } max{10,9} 10
d (1, 2), 5 max{ d1, 5 , d 2,5 } max{ 9,8} 9
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
d (1, 2 ), ( 4 ,5) max{ d (1, 2), 4 , d (1, 2 ), 5 } max{10,9} 10

d 3 ,( 4 , 5) max{ d3 , 4 , d 3, 5 } max{ 7,5} 7
4
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
d (1, 2 , 3 ),( 4 , 5 ) max{ d(1, 2 ), ( 4 , 5 ) , d 3, ( 4 , 5 )} 10
4
3
2
1
Dendogram: Hierarchical Clustering

Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
54
UnderstandingtheData
Data Cleaning
Missing Values, Noisy Values, Outliers
Dates
Nominal/Numeric
Discretization
Normalization
Smoothing
Transformation
Attribute selection
55
Can'tbeexpectedtobeexpertinallfields,but
understandingthedatacanbeextremelyuseful
fordatamining.
What data is available?
What available data is actually relevant or useful?
Can the data be enriched from other sources?
Are there historical datasets available?
Who is the Real expert to ask questions of?
(Are the results at all sensible? Or are they completely
obvious?)
Answers to these questions before embarking

on a data mining project are invaluable later on.
56
Numberofinstancesavailable:
5000 or more for reliable results
Number
of attributes:
Depends on data set, but any attribute less
than 10 instances is typically not worth

including
Number
of instances per class:
More than 100 per class

If very unbalanced, consider stratified
sampling
57
Goal: maximizing data quality

Assess data quality
Correct noted deficiencies
Assess data quality
Analyze data distribution: is there any strange data
distribution?
Analyze data elements: check inconsistencies,
redundant, missing values, outlier, etc.
Conduct physical audit: ensure data recorded
properly, for example: cross check data to
customer
Analyze business rules: check data violates
business rules
Exclude
the attribute for which data is

frequently missing
Exclude records that have missing data
Extrapolate missing values from other
known values
Use a predictive model to determine a
value
For quantitative values, use a generic
figure, such as the average.
Wewantalldatestobethesame.YYYYMM
DDisanISOstandard,BUTithassomeissues
fordatamining.
Year 10,000 AD! (we only have 4 digits)
Dates BC[E] eg -0300-02-03 is not a valid YYYY-MM-
DD date.
Most importantly: Does not preserve intervals,
with/without the second
Other representations:
Posix/UnixSystemDate:Numberofsecondssince1
970
etc
60
Nominal data without ordering, eg: Sex, Country, etc

Some algorithms can't deal with nominal or numeric
attributes. Eg Decision trees deal best with nominal, but
Neural Networks and many clustering algorithms require
only numeric attribute values.
In case the algorithm requires converting Nominal to
Numeric
Binary field: One value is 0, other value is 1
(eg gender)
Ordered fields: Convert to numbers to preserve order (eg A vs C
grade becomes 4 and 2 respectively)
Few Values: Convert each value into a new binary attribute, for
example: possible values for attribute AT are A, B, C, D then you can
create 4 new attributes ATa, ATb, ATc, ATd with each attribute has
value either 0 or 1
Many Values: Convert into groups of values, each with its own
(binary) attribute. eg group states in the US into 5 groups of 10.
Unique Values: Ignore identifier like attributes (buang atribut)
61
Some algorithms require nominal or discrete values. How

can we turn a numeric value into a nominal value, or a
numeric value with a smaller range of values.
Several Discretization techniques. Often called 'binning.
Equal Width
Equal Depth
Class Dependent
Entropy
Fuzzy
(Allow some fuzziness as to the edges of the bin)

Non-disjoint (Allow overlapping intervals)
ChiMerge (Use Chi-Squared Test in the same way as Entropy)
Iterative
(Use some technique, then minimise classifier error)
Lazy
(Only discretize during classification (VERY lazy!))
Proportional k-Interval (Number of bins = square root of instances)
62
We might want to normalise our data such that two

numeric values are comparable. For example to
compare age and income.
Decimal Scaling: v' = v/10k for smallest k such

that max(abs(v'))<1
Eg: -991 and 99, k is 3, and -991 becomes -0.991
Min/Max Normalisation:
v'= (v-min)/(max-min) * (newMax-newMin) +
newMin
Eg: 73600 in [12000,98000] to [0,1] is 0.716
Zero-mean Normalization: v'=(v-mean)/stddev

eg: mean 54000, stddev = 16000, v=73600 then v'
= 1.225
In the case of noisy data, we might want to

smooth the data such that it is more uniform.
Some possible techniques:

Regression: Find the function for the data, and
move each value some amount closer to what the

function predicts (see classification)
Clustering: Some clustering techniques remove
outliers. We could cluster the data to remove these
noisy values.
Binning: We could use some technique to discretize
the data, and then smooth based on those 'bins'.
Transform
data to more meaningful
form
For example:
Birth date is transformed to Age
Date of the first transaction is
transformed to number of days since

the customer becomes member
Grades of each course are
transformed to cumulative GPA
Before getting to the data mining, we may want to

either remove instances or select only a portion of
the complete data set to work with.
Why? Perhaps our algorithms don't scale well to the
amount of data we have
Techniques:
Records selection
Partitioning: Split the database into sections and work with each in
turn. Often not appropriate unless the algorithm is designed to do
it.
Sampling: Select a random subset of the data and use that which
is hopefully representative.
Attribute selection
Stepwise Forward Selection: Find the best attribute and add.
Stepwise Backward Elimination: Find the worst attribute and
remove.
Genetic Algorithms: Use a 'survival of the fittest' along with
random cross-breeding approach
etc
What
technique will you use to solve this

problem?
Given set of applicant attributes (name, salary,
age, etc), you want to decide whether you have to

approve customer application on credit card or not.
Given national examination scores, you want to
group Kabupatens into three educational level:
Good, Average, Poor
You want to suggest to your customer about
suitable pant given her/his choice of shirt.
You want to estimate economic growth of Indonesia
given some data (GNP, GDP, etc)
Pak Bedu adalah

seorang dokter yang
ingin menggunakan TI
untuk membantunya
memutuskan apakah
pasiennya terkena
kanker atau tidak.
Untuk memutuskan hal
tersebut, Pak Bedu
sudah memiliki data
setiap pasien yang
meliputi hasil uji dalam
5 test laboratorium
serta keputusan apakah
dia terkena kanker atau
tidak. Berikut ini contoh
datanya:
ID
T1
T2
T3
T4
T5
Cance
r?
P1
Yes
P2
No
P3
No
P4
Yes
P5
No
Masalah apa yang bisa Anda temukan

di data Pak Bedu?
Pada
saat Pak Bedu memeriksa satu

atribut (misal T1) ditemukan 5 atribut
yang distinct dengan jumlah sebagai
berikut:
1 dengan jumlah 1234

11 dengan jumlah 1
Apa
yang bisa Anda simpulkan dengan

melihat data tersebut?
Pak Bedu mengembangkan usaha kliniknya di 3 tempat.

Pak Bedu menyerahkan sepenuhnya pengelolaan data
pasien ke setiap klinik. Pak Bedu ingin mengetahui
karakteristik dari pasien untuk usahanya dengan
mengumpulkan data dari ketiga kliniknya. Hanya saja
Pak Bedu bingung karena setiap klinik memiliki skema
data yang berbeda-beda. Apa yang harus Pak Bedu
lakukan?
Skema Klinik 1: Pasien( Nama, TglLahir, Tinggi (meter),
Berat(kg), JenisKelamin (L/P), Alamat, Provinsi)
Skema Klinik 2: Pasien( Nama, TglLahir, Tinggi
(centimeter), Berat(kg), JenisKelamin (P/W), Alamat, Kota
)
Skema Klinik 3: Pasien( Nama, Umur, Tinggi (meter),
Berat(kg), JenisKelamin (L/P), Kota, Provinsi)
Pak Bedu ternyata juga memiliki usaha Sekolah Tinggi

Kesehatan. Sebagai orang yang baik hati, Pak Bedu ingin
memberikan beasiswa untuk mahasiswanya yang sedang
mengerjakan skripsi. Hanya saja Pak Bedu ingin
memperoleh mahasiswa yang memiliki potensi untuk
bisa menyelesaikan skripsinya dalam waktu satu
semester. Pak Bedu memiliki data nilai mahasiswa yang
sudah lulus beserta lama waktu penyelesaian studinya.
Skema data yang dimiliki Pak Bedu antara lain: Tabel
Mahasiswa(NPM, Nama, Umur, AsalDaerah, LamaSkripsi,
LamaStudi), Tabel MataKuliah(Kode, Nama, Kelompok),
Tabel Nilai(NPM, KodeMK, Nilai)
Diskusikan apa yang kira-kira bisa Anda lakukan untuk
membantu Pak Bedu
Pak
Bedu mengembangkan kliniknya

sampai 100 cabang. Pak Bedu ingin
melihat pola-pola kunjungan pasien,
daerah mana yang banyak orang
sakit, kapan banyak yang sakit, dsb.
Pak Bedu memiliki data tentang
klinik dan jumlah total kunjungan
pasien setiap bulannya.
Apa yang bisa Anda lakukan?

Data Mining: Magister Teknologi Informasi Universitas Indonesia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Magister Teknologi Informasi Universitas Indonesia

Uploaded by

Copyright:

Available Formats

Magister Teknologi Informasi

business transactions ...

More data is captured:

faster and cheaper

We have large data stored in one or more

There is often information hidden in the data that is

Total new disk (TB) since 1995

Mining is a process of extracting

from [Connolly, 2005]:

The process of extracting valid, previously

unknown, comprehensive, and actionable

thin red line of data mining: it is

What is not Data

What is Data Mining?

Data Mining in Knowledge Discovery

Association rules try to find association

Antecedent: bread and butter

A transaction is a set of items: T={ia, ib,it}

items sold by supermarket (I:Itemset): {sugar,

Transaction by customer (T): T1: {sugar, onion,

Database (D): {T1={salt, bread, olives}, T2={sugar,

An association rule is the form:

Support of a rule P Q = Support of (P Q)

Q in the subset of transactions that contain already

minimum support: minsup

support of P larger than minimum

Min. support 50%

For rule {A} {C}:

support = support({A, C}) = 50%

For rule {C} {A}:

Find all strong rules that associate the

auto accessories also get automotive services done

most famous algorithm is APRIORI

the frequent itemsets: the sets of

Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset)

the frequent itemsets to generate

T100 I1, I2, I5

Step 1: Generating 1-itemset Frequent Pattern

Step 2: Generating 2-itemset

To discover the set of frequent 2-itemsets, L2,

Step 3: Generating 3-itemset

Based on the Apriori property that all subsets

BUT, {I3, I5} is not a member of L2 and

Step 4: Generating 4-itemset Frequent

Step 5: Generating Association Rules

For each frequent itemset I, generate all

Let minimum confidence thresholdis , say

R1: {I1,I2} {I5}

R2: {I1,I5} {I2}

R3: {I2,I5} {I1}

R4: {I1} {I2,I5}

R5: {I2} {I1,I5}

R6: {I5} {I1,I2}

In this way, We have found three strong

Learn a method for predicting the instance class

Prepare a collection of records (training set )

Tid Refund Marital

Tid Refund Marital

Model: Decision Tree

Tid Refund Marital

There could be more than one tree that

Use this information as input attributes to learn a