You are on page 1of 72

Magister Teknologi Informasi

Universitas Indonesia
2012

Data Mining

More data is
generated:
Bank, telecom, other

business transactions ...


Scientific Data:
astronomy, biology, etc
Web, text, and ecommerce

More data is captured:


Storage technology

faster and cheaper


DBMS capable of
handling bigger DB

We have large data stored in one or more


database/s.
We starved to find new information
within those data (for research usage,
competitive edge, etc).
We want to identify patterns or rules
(trends and relationships) in those data.
We know that a certain data exist inside a
database, but what are the consequences
of that datas existence?

There is often information hidden in the data that is


not readily evident
Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
4,000,000
3,500,000

The Data
Gap

3,000,000
2,500,000
2,000,000
1,500,000

Total new disk (TB) since 1995

1,000,000

Number of
analysts

500,000
0
1995

1996

1997

1998

1999

From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

Data

Mining is a process of extracting


previously unknown, valid and
actionable information from large
databases and then using the
information to make crucial business
decisions (Cabena et al. 1998).
Data mining: discovering interesting
patterns from large amounts of data
(Han and Kamber, 2001).

Definition

from [Connolly, 2005]:

The process of extracting valid, previously

unknown, comprehensive, and actionable


information from large databases and
using it to make crucial business
decisions.
The

thin red line of data mining: it is


all about finding patterns or rules by
extracting data from large databases
in order to find new information that
could lead to new knowledge.

What is not Data


Mining?

Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon

What is Data Mining?


Certain names are more
prevalent in certain US
locations (OBrien, ORurke,
OReilly in Boston area)
Group together similar
documents returned by search
engine according to their
context (e.g. Amazon
rainforest, Amazon.com,)

Data Mining in Knowledge Discovery


Process
Integration
Da
ta

Se
& lect
Cl io
ea n
nin
g

DATA
Ware
house

ma
tio

Mi
nin

Knowledge

__ __ __
__ __ __
__ __ __

Transformed
Data
Target
Data

Knowledge

Patterns
and
Rules

Understanding

Raw
Data

Tra
ns
f or

Interpretation
& Evaluation

Clustering
Classification
Association Rules
Other Methods:
Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data
mining, web mining

Association rules try to find association


between items in a set of transactions.
For example, in the case of association
between items bought by customers in
supermarket:
90% of transactions that purchase bread
and butter also purchase milk

Antecedent: bread and butter


Consequent: milk
Confidence factor: 90%

10

A transaction is a set of items: T={ia, ib,it}


T I, where I is the set of all possible items
{i1, i2,in}
D, the task relevant data, is a set of
transactions (database of transactions).
Example:

items sold by supermarket (I:Itemset): {sugar,


parsley, onion, tomato, salt, bread, olives, cheese, butter}

Transaction by customer (T): T1: {sugar, onion,


salt}

Database (D): {T1={salt, bread, olives}, T2={sugar,


onion, salt}, T3={bread}, T4={cheese, butter},
T5={tomato}, }

11

An association rule is the form:


P Q, where P I, Q I, and P Q =

Example:
{bread} {butter, cheese}
{onion, tomato} {salt}

12

Support of a rule P Q = Support of (P Q)


in D
sD(P Q ) = sD(P Q)
percentage of transactions in D containing P and Q.
#transactions containing P and Q divided by

cardinality of D.

Confidence of a rule P Q
cDP Q) = sD(P Q)/sD(P)
percentage of transactions that contain both P and

Q in the subset of transactions that contain already


P.
13

Thresholds:

minimum support: minsup


minimum confidence: minconf

Frequent

itemset P

support of P larger than minimum

support

Strong

rule P Q (c%)

(P Q) frequent,
c is larger than minimum confidence

14

Transaction
ID

Items
Bought

Min. support 50%


Min. confidence 50%

2000

A, B, C

1000

A, C

Frequent
Itemset

4000

A, D

{A}

75%

5000

B, E, F

{B}

50%

{C}

50%

{A,C}

50%

For rule {A} {C}:

Support

support = support({A, C}) = 50%


confidence = support({A, C})/support({A}) = 66.6%

For rule {C} {A}:


support = support({A, C}) = 50%
confidence = support({A, C})/support({C}) = 100.0%
15

Input
A database of transactions
Each transaction is a list of items (Ex. purchased by

a customer in a visit)

Find all strong rules that associate the


presence of one set of items with that of
another set of items.
Example: 98% of people who purchase tires and

auto accessories also get automotive services done


There are no restrictions on the number of items in
the head or body of the rule.

The

most famous algorithm is APRIORI


16

Find

the frequent itemsets: the sets of


items that have minimum support
A subset of a frequent itemset must also

be a frequent itemset
i.e., if {AB} isa frequent itemset, both {A} and
{B} should be a frequent itemset

Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset)


Use

the frequent itemsets to generate


association rules.
Source: [Sunysb, 2009]

TID

List of Items

T100 I1, I2, I5

T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3

T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3

Consider a database, D,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22%).
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using
apriori algorithm.
Then, association rules will
be generated using min.
support & min. confidence

Step 1: Generating 1-itemset Frequent Pattern


Scan D
for count
of each
candidate

Itemset
{l1}
{l2}

Compare
Sup.Count candidate
support count
6
with minimum
support count
7

Itemset

Sup.Count

{l1}

{l2}

{l3}

{l3}

{l4}

{l4}

{;5}

{;5}

C1

L1

The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
In the first iteration of the algorithm, each item is member of the set of
candidate

Step 2: Generating 2-itemset


Itemset
Sup.
Frequent
Pattern
Itemset
Count

Generate
C2
candidat
es from
L1

{l1,l2}

{l1,l3}

{l1,l4}

{l1,l5}

{l2,l3}

{l2,l3}

{l2,l4}

{l2,l4}

{l2,l5}

{l2,l5}

{l3,l4}

{l3,l4}

{l3,l5}

{l3,l5}

{l4,l5}
C2

{l4,l5}

{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}

Scan D for
count of
each
candidate

C2

Compare
candidate
support
count with
minimum
support
count

Itemset

Sup.
Count

{l1,l2}

{l1,l3}

{l1,l5}

{l2,l3}

{l2,l4}

{l2,l5}

2
L2

To discover the set of frequent 2-itemsets, L2,


the algorithm uses L1JoinL1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and
the support count for each candidate itemset
in C2 is accumulated (as shown in the middle
table).
The set of frequent 2-itemsets, L 2, is then
determined, consisting of those candidate 2itemsets in C2 having minimum support.
Note:We havent used Apriori Property yet.


Generate
C3
candidat
es from
L2

Step 3: Generating 3-itemset


Compare
Frequent
Pattern
Scan D
candidate
for count
of each
candidate

Itemset

Itemset

Sup.
Count

{l1,l2,l3}
{l1,l2,l5}

{l1,l2,l3}
{l1,l2,l5}
C3

C3

support
count with
minimum
support
count

Itemset

Sup.
Count

{l1,l2,l3}

{l1,l2,l5}

2
L3

The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2JoinL2.
C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.

Based on the Apriori property that all subsets


of a frequent itemset must also be frequent, we
can determine that four latter candidates
cannot possibly be frequent.
For example, lets take {I1, I2, I3}. The 2-item
subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are
members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5}
which shows how the pruning is performed. The
2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

BUT, {I3, I5} is not a member of L2 and


hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2,
I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}}
after checking for all members of result of
Join operation for Pruning.
Now, the transactions in D are scanned in
order to determine L3, consisting of those
candidates 3-itemsets in C having minimum
support.

Step 4: Generating 4-itemset Frequent


Pattern
The algorithm uses L3JoinL3 to generate a
candidate set of 4-itemsets, C4. Although the join
results in {{I1, I2, I3, I5}}, this itemset is pruned
since its subset {{I2, I3, I5}} is not frequent.
Thus, C4= , and algorithm terminates, having
found all of the frequent items. This
completes our Apriori Algorithm.
Whats Next ? These frequent itemsets will be used
to generate strong association rules( where
strong association rules satisfy both minimum
support & minimum confidence).

Step 5: Generating Association Rules


from Frequent Itemsets
Procedure:

For each frequent itemset I, generate all

nonempty subsets of I.
For every nonempty subset s of I, output the rule
s (I-s) if support_count(I) /
support_count(s) min_conf where min_conf
is minimum confidence threshold.

In

our example:

We had L = {{l1},{l2},{l3},{l4},{l5},

{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},
{l2,l5},{l1,l2,l3},{l1,l2,l5}}.
Lets take I = {l1,l2,l5}
Its all nonempty subsets are {l1,l2}, {l1,l5},
{l2,l5}, {l1}, {l2}, {l5}

Let minimum confidence thresholdis , say


70%.
The resulting association rules are shown
below, each listed with its confidence.

R1: {I1,I2} {I5}


Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is Rejected.

R2: {I1,I5} {I2}


Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
R2 is Selected.

R3: {I2,I5} {I1}


Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
R3 is Selected.

R4: {I1} {I2,I5}


Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.

R5: {I2} {I1,I5}


Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.

R6: {I5} {I1,I2}


Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.

In this way, We have found three strong


association rules.

Learn a method for predicting the instance class


from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

Prepare a collection of records (training set )


Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function of the
values of other attributes (decision tree, neural
network, etc)
Prepare test set to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
After happy with the accuracy, use your model to
classify new instance

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
32

ca

go
e
t

l
ir ca
ca

go
e
t

l
ir ca

tin
n
co

u
uo

ss
a
l
c

Refund Marital
Status

Taxable
Income Cheat

No

No

Single

75K

100K

No

Yes

Married

50K

Single

70K

No

No

Married

150K

Yes

Married

120K

No

Yes

Divorced 90K

No

Divorced 95K

Yes

No

Single

40K

No

Married

No

No

Married

80K

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

Married

No

60K

10

10

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

Training
Set

Learn
Classifier

Test
Set

Model

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
33

ca

go
e
t

l
a
c
ri
ca

go
e
t

l
a
c
ri

tin
n
co

us
o
u

ss
a
cl

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Splitting Attributes

Refund
Yes

No

NO

MarSt
Single, Divorced
TaxInc

< 80K
NO

NO
> 80K
YES

10

Training Data

Married

Model: Decision Tree

te
ca

10

ica
r
go

te
ca

ica
r
go

l
tin
n
co

us
o
u

ss
a
l
c

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Married

MarSt

NO

Single,
Divorced
Refund
No

Yes
NO

TaxInc
< 80K
NO

> 80K
YES

There could be more than one tree that


fits the same data!

Test Data
Start from the root of tree.
Refund
Yes

10

No

NO

MarSt
Single, Divorced
TaxInc
< 80K
NO

Married
NO

> 80K
YES

Refund Marital
Status

Taxable
Income Cheat

No

80K

Married

Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, dont buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such
customers.
Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a


classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997


37

Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc

Label past transactions as fraud or fair


transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
38

Customer

Attrition/Churn:

Goal: To predict whether a customer is likely

to be lost to a competitor.
Approach:
Use detailed record of transactions with each of
the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital
status, etc.

Label the customers as loyal or disloyal.


Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
39

Clustering is a process of partitioning a set of data


(or objects) in a set of meaningful sub-classes, called
clusters.

Helps users understand the natural


grouping or structure in a data set.
Cluster: a collection of data objects that
are similar to one another and thus
can be treated collectively as one group.
Clustering: unsupervised classification:
no predefined classes

40

Find natural grouping of


instances given un-labeled data

A good clustering method will produce high


quality clusters in which:
the intra-class similarity (that is within a cluster) is high.
the inter-class similarity (that is between clusters) is

low.

The quality of a clustering result also depends on


both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
The quality of a clustering result also depends on
the definition and representation of cluster
chosen.
42

Partitioning

algorithms: Construct
various partitions and then evaluate
them by some criterion.
Hierarchy algorithms: Create a
hierarchical decomposition of the set
of data (or objects) using some
criterion. There is an agglomerative
approach and a divisive approach.
43

Partitioning method: Given a number k,


partition a database D of n objects into a set
of k clusters so that a chosen objective
function is minimized (e.g., sum of distances
to the center of the clusters).
Global optimum: exhaustively enumerate all
partitions too expensive!
Heuristic methods based on iterative
refinement of an initial partition

44

Hierarchical decomposition of the data set (with


respect to a given similarity measure) into a set of
nested clusters
Result represented by a so called dendrogram
Nodes in the dendrogram represent possible clusters
can be constructed bottom-up (agglomerative
approach) or top down (divisive approach)

Clustering obtained by cutting the dendrogram at a desired level: each


connected component forms a cluster.
45

cluster

similarity = similarity of two most


similar members
- Potentially long and
skinny clusters
+ Fast

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 3 0

4 9 7 0

5 8 5 4 0

d (1, 2 ), 3 min{ d1 ,3 , d 2 , 3 } min{ 6,3} 3

d (1, 2 ), 4 min{ d1, 4 , d 2 , 4 } min{ 10,9} 9

d (1, 2 ), 5 min{ d1,5 , d 2 , 5 } min{ 9,8} 8

3
2
1

1
1
2
3
4
5

0
2

10
9

3 4 5

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 3 0

4 9 7 0

5 8 5 4 0

(1,2,3) 4 5
(1,2,3) 0

4 7 0
5 5 4 0

d (1, 2 , 3 ), 4 min{ d(1, 2 ), 4 , d 3 , 4 } min{ 9,7} 7


d (1, 2 , 3 ),5 min{ d (1, 2 ), 5 , d3 , 5 } min{ 8,5} 5

4
3
2
1

1
1 0
2 2
3 6

4 10
5 9

(1,2) 3 4 5

3 4 5

3 0

9 7 0
8 5 4 0

(1,2) 0

3 3 0

4 9 7 0

5 8 5 4 0

(1,2,3) 4 5
(1,2,3) 0

4 7 0
5 5 4 0

d (1, 2 , 3 ),( 4 , 5 ) min{ d (1, 2, 3), 4 , d (1, 2 , 3 ),5 } 5

4
3
2
1

cluster similarity = similarity of two least


similar members

+ tight clusters
- slow

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 6 0

4 10 7 0

5 9 5 4 0

d (1, 2), 3 max{ d1, 3 , d 2, 3} max{ 6,3} 6

d (1, 2), 4 max{ d1 , 4 , d 2 , 4 } max{10,9} 10

d (1, 2), 5 max{ d1, 5 , d 2,5 } max{ 9,8} 9

3
2
1

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 6 0

4 10 7 0

5 9 5 4 0

(1,2) 3 ( 4,5)
(1,2) 0

3 6 0
(4,5) 10 7 0

d (1, 2 ), ( 4 ,5) max{ d (1, 2), 4 , d (1, 2 ), 5 } max{10,9} 10


d 3 ,( 4 , 5) max{ d3 , 4 , d 3, 5 } max{ 7,5} 7

4
3
2
1

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 6 0

4 10 7 0

5 9 5 4 0

(1,2) 3 ( 4,5)
(1,2) 0

3 6 0
(4,5) 10 7 0

d (1, 2 , 3 ),( 4 , 5 ) max{ d(1, 2 ), ( 4 , 5 ) , d 3, ( 4 , 5 )} 10

4
3
2
1

Dendogram: Hierarchical Clustering


Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

54

UnderstandingtheData
Data Cleaning
Missing Values, Noisy Values, Outliers
Dates
Nominal/Numeric
Discretization

Normalization
Smoothing
Transformation
Attribute selection
55

Can'tbeexpectedtobeexpertinallfields,but
understandingthedatacanbeextremelyuseful
fordatamining.
What data is available?
What available data is actually relevant or useful?
Can the data be enriched from other sources?
Are there historical datasets available?
Who is the Real expert to ask questions of?
(Are the results at all sensible? Or are they completely

obvious?)

Answers to these questions before embarking


on a data mining project are invaluable later on.
56

Numberofinstancesavailable:
5000 or more for reliable results

Number

of attributes:

Depends on data set, but any attribute less

than 10 instances is typically not worth


including
Number

of instances per class:

More than 100 per class


If very unbalanced, consider stratified

sampling
57

Goal: maximizing data quality


Assess data quality
Correct noted deficiencies

Assess data quality

Analyze data distribution: is there any strange data

distribution?
Analyze data elements: check inconsistencies,
redundant, missing values, outlier, etc.
Conduct physical audit: ensure data recorded
properly, for example: cross check data to
customer
Analyze business rules: check data violates
business rules

Exclude

the attribute for which data is


frequently missing
Exclude records that have missing data
Extrapolate missing values from other
known values
Use a predictive model to determine a
value
For quantitative values, use a generic
figure, such as the average.

Wewantalldatestobethesame.YYYYMM
DDisanISOstandard,BUTithassomeissues
fordatamining.
Year 10,000 AD! (we only have 4 digits)
Dates BC[E] eg -0300-02-03 is not a valid YYYY-MM-

DD date.
Most importantly: Does not preserve intervals,
with/without the second

Other representations:
Posix/UnixSystemDate:Numberofsecondssince1

970
etc
60

Nominal data without ordering, eg: Sex, Country, etc


Some algorithms can't deal with nominal or numeric
attributes. Eg Decision trees deal best with nominal, but
Neural Networks and many clustering algorithms require
only numeric attribute values.
In case the algorithm requires converting Nominal to
Numeric
Binary field: One value is 0, other value is 1

(eg gender)
Ordered fields: Convert to numbers to preserve order (eg A vs C
grade becomes 4 and 2 respectively)
Few Values: Convert each value into a new binary attribute, for
example: possible values for attribute AT are A, B, C, D then you can
create 4 new attributes ATa, ATb, ATc, ATd with each attribute has
value either 0 or 1
Many Values: Convert into groups of values, each with its own
(binary) attribute. eg group states in the US into 5 groups of 10.
Unique Values: Ignore identifier like attributes (buang atribut)

61

Some algorithms require nominal or discrete values. How


can we turn a numeric value into a nominal value, or a
numeric value with a smaller range of values.
Several Discretization techniques. Often called 'binning.

Equal Width
Equal Depth
Class Dependent
Entropy
Fuzzy

(Allow some fuzziness as to the edges of the bin)


Non-disjoint (Allow overlapping intervals)
ChiMerge (Use Chi-Squared Test in the same way as Entropy)
Iterative
(Use some technique, then minimise classifier error)
Lazy
(Only discretize during classification (VERY lazy!))
Proportional k-Interval (Number of bins = square root of instances)
62

We might want to normalise our data such that two


numeric values are comparable. For example to
compare age and income.

Decimal Scaling: v' = v/10k for smallest k such


that max(abs(v'))<1
Eg: -991 and 99, k is 3, and -991 becomes -0.991

Min/Max Normalisation:
v'= (v-min)/(max-min) * (newMax-newMin) +
newMin
Eg: 73600 in [12000,98000] to [0,1] is 0.716

Zero-mean Normalization: v'=(v-mean)/stddev


eg: mean 54000, stddev = 16000, v=73600 then v'
= 1.225

In the case of noisy data, we might want to


smooth the data such that it is more uniform.

Some possible techniques:


Regression: Find the function for the data, and

move each value some amount closer to what the


function predicts (see classification)
Clustering: Some clustering techniques remove
outliers. We could cluster the data to remove these
noisy values.
Binning: We could use some technique to discretize
the data, and then smooth based on those 'bins'.

Transform

data to more meaningful

form
For example:
Birth date is transformed to Age
Date of the first transaction is

transformed to number of days since


the customer becomes member
Grades of each course are
transformed to cumulative GPA

Before getting to the data mining, we may want to


either remove instances or select only a portion of
the complete data set to work with.
Why? Perhaps our algorithms don't scale well to the
amount of data we have
Techniques:
Records selection
Partitioning: Split the database into sections and work with each in
turn. Often not appropriate unless the algorithm is designed to do
it.
Sampling: Select a random subset of the data and use that which
is hopefully representative.
Attribute selection
Stepwise Forward Selection: Find the best attribute and add.
Stepwise Backward Elimination: Find the worst attribute and
remove.
Genetic Algorithms: Use a 'survival of the fittest' along with
random cross-breeding approach
etc

What

technique will you use to solve this


problem?
Given set of applicant attributes (name, salary,

age, etc), you want to decide whether you have to


approve customer application on credit card or not.
Given national examination scores, you want to
group Kabupatens into three educational level:
Good, Average, Poor
You want to suggest to your customer about
suitable pant given her/his choice of shirt.
You want to estimate economic growth of Indonesia
given some data (GNP, GDP, etc)

Pak Bedu adalah


seorang dokter yang
ingin menggunakan TI
untuk membantunya
memutuskan apakah
pasiennya terkena
kanker atau tidak.
Untuk memutuskan hal
tersebut, Pak Bedu
sudah memiliki data
setiap pasien yang
meliputi hasil uji dalam
5 test laboratorium
serta keputusan apakah
dia terkena kanker atau
tidak. Berikut ini contoh
datanya:

ID

T1

T2

T3

T4

T5

Cance
r?

P1

Yes

P2

No

P3

No

P4

Yes

P5

No

Masalah apa yang bisa Anda temukan


di data Pak Bedu?

Pada

saat Pak Bedu memeriksa satu


atribut (misal T1) ditemukan 5 atribut
yang distinct dengan jumlah sebagai
berikut:

1 dengan jumlah 1234


2 dengan jumlah 2037
3 dengan jumlah 1659
4 dengan jumlah 1901
11 dengan jumlah 1

Apa

yang bisa Anda simpulkan dengan


melihat data tersebut?

Pak Bedu mengembangkan usaha kliniknya di 3 tempat.


Pak Bedu menyerahkan sepenuhnya pengelolaan data
pasien ke setiap klinik. Pak Bedu ingin mengetahui
karakteristik dari pasien untuk usahanya dengan
mengumpulkan data dari ketiga kliniknya. Hanya saja
Pak Bedu bingung karena setiap klinik memiliki skema
data yang berbeda-beda. Apa yang harus Pak Bedu
lakukan?
Skema Klinik 1: Pasien( Nama, TglLahir, Tinggi (meter),
Berat(kg), JenisKelamin (L/P), Alamat, Provinsi)
Skema Klinik 2: Pasien( Nama, TglLahir, Tinggi
(centimeter), Berat(kg), JenisKelamin (P/W), Alamat, Kota
)
Skema Klinik 3: Pasien( Nama, Umur, Tinggi (meter),
Berat(kg), JenisKelamin (L/P), Kota, Provinsi)

Pak Bedu ternyata juga memiliki usaha Sekolah Tinggi


Kesehatan. Sebagai orang yang baik hati, Pak Bedu ingin
memberikan beasiswa untuk mahasiswanya yang sedang
mengerjakan skripsi. Hanya saja Pak Bedu ingin
memperoleh mahasiswa yang memiliki potensi untuk
bisa menyelesaikan skripsinya dalam waktu satu
semester. Pak Bedu memiliki data nilai mahasiswa yang
sudah lulus beserta lama waktu penyelesaian studinya.
Skema data yang dimiliki Pak Bedu antara lain: Tabel
Mahasiswa(NPM, Nama, Umur, AsalDaerah, LamaSkripsi,
LamaStudi), Tabel MataKuliah(Kode, Nama, Kelompok),
Tabel Nilai(NPM, KodeMK, Nilai)
Diskusikan apa yang kira-kira bisa Anda lakukan untuk
membantu Pak Bedu

Pak

Bedu mengembangkan kliniknya


sampai 100 cabang. Pak Bedu ingin
melihat pola-pola kunjungan pasien,
daerah mana yang banyak orang
sakit, kapan banyak yang sakit, dsb.
Pak Bedu memiliki data tentang
klinik dan jumlah total kunjungan
pasien setiap bulannya.
Apa yang bisa Anda lakukan?

You might also like