You are on page 1of 34

1

DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS



The basic data mining algorithms introduced may be enhanced in a number of
ways.

Basic data mining algorithms have traditionally assumed data is memory
resident, as for example in the case of the Apriori algorithm for association rule
mining or basic clustering algorithms.

Hence, techniques to ensure efficient data mining algorithms with large data
sets have been developed.

Another problem is that once a data mining model has been developed, there
has traditionally been no mechanism for that model to be reused
programmatically by other applications on other data sets.

Hence, standards for data mining model exchange have been developed.

2
This trend has been accelerated as interoperability issues become of increasing
importance to enable the deployment of cloud computing data mining
applications.

Finally, even for data which is held in a database or data warehouse, data
mining has traditionally been performed by dumping data from the database to
an external file, which is then transformed and mined.

This results in a series of files for each data mining application, with the
resulting problems of data redundancy, inconsistency and data dependence,
which database technology was designed to overcome.

Hence, techniques and standards for tighter integration of database and data
mining technology have been developed.


3
DATA MINING OF LARGE DATA SETS

Algorithms for classification, clustering, and association rule mining are
considered.

CLASSIFYING LARGE DATA SETS: SUPPORT VECTOR MACHINES

To reduce the computational cost of solving the SVM optimization problem with
large training sets, chunking is used. This partitions the training set into
chunks each of which fits into memory. The support vector parameters are
computed iteratively for each chunk. However, multiple passes of the data are
required to obtain an optimal solution.

Another approach is to use squashing. In this, the SVM is trained over clusters
derived from the original training set, with the clusters reflecting the distribution
of the original training records.

A further approach reformulates the optimization problem to allow solution by
efficient iterative algorithms.



4
CLUSTERING LARGE DATA SETS: K-MEANS

Unless there is sufficient main memory to hold the data being clustered, the
data scan at each iteration of the K-means algorithm will be very costly.

An approach for large databases would

Perform at most one scan of the database.

Work with limited memory.

Approaches include the following:

Identify three kinds of data objects:
those which are discardable because membership of a cluster has
been established;
those which are compressible which while not discardable belong to a
well-defined subcluster which can be characterized in a compact
structure;
those which are neither discardable nor compressible which must be
retained in main memory.

Alternatively, first group data objects into microclusters and then perform
k-means clustering on those microclusters.
5
An approach developed at Microsoft uses such approaches as follows:

1. Read a sample subset of data from the database.

2. Cluster that data with the existing model as usual to produce an
updated model.

3. On the basis of the updated model, decide for each data item from the
sample whether it needs to be
retained in memory
discarded with summary information being updated
retained in a compressed form as summary information.

4. Repeat from 1 until termination condition met.


6
ASSOCIATION MINING OF LARGE DATA SETS: APRIORI

With one database scan for each itemset size tested, the cost of scans would
be prohibitive for the Apriori algorithm unless the database is resident in
memory.

Approaches to enhance the efficiency of Apriori include the following.

While generating 1-itemsets for each transaction, generate 2-itemsets at
the same time, hashing the 2-itemsets to a hash table structure.

All buckets whose final count of itemsets is less than the minimum support
threshold can be ignored subsequently since any itemset therein will itself
not have the minimum required support.

7
Overlap the testing of k-itemsets and (k+1)-itemsets by counting (k+1)-
itemsets in parallel with counting k-itemsets.

Unlike in conventional Apriori in which candidate (k+1)-itemsets are only
generated after the k-itemset database scan, in this approach a database
scan is divided into blocks before any of which candidate (k+1)-itemsets
can be generated during the k-itemset scan.

Two database scans only are needed if a partitioning approach is adopted
under which transactions are divided into n partitions each of which can be
held in memory.

In the first scan frequent itemsets for each partition are generated. These
are combined to create a candidate frequent itemsets list for the database
as a whole. In the second scan, the actual support for members of the
candidate frequent itemsets list is checked.

8
Pick a random sample of transactions which will fit in memory and search
for frequent itemsets in that sample.

This may result in some global frequent itemsets being missed. The
chance of this happening can be lessened by adopting a lower minimum
support threshold for the sample, with the remaining database then being
used to check actual support for the candidate itemsets.

A second database scan may be needed to ensure no frequent itemsets
have been missed.
9
DATA MINING STANDARDS

Data mining standards and related standards for data grids, web services and
the semantic web enable the easier deployment of data mining applications
across platforms.

Standards cover:

The overall KDD process.

Metadata interchange with data warehousing applications.

The representation of data cleaning, data reduction and transformation
processes.

The representation of data mining models.

APIs for performing data mining processes from other languages including
SQL and J ava.

10
CRISP-DM

CRISP-DM (CRoss Industry Standard Process for Data Mining) specifies a
process model covering the following 6 phases of the KDD process:

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

www.the-modeling-agency.com/crisp-dm.pdf
11
PREDICTIVE MODEL MARKUP LANGUAGE PMML

PMML is an XML-based standard developed by the Data Mining Group
(www.dmg.org) which is a consortium of data mining product vendors.

PMML represents data mining models as well as operations for cleaning and
transforming data prior to modeling.

The aim is to enable an application to produce a data mining model in a form
PMML XML - which another data mining application can read and apply.

Below is shown the PMML representation of an example association rules
model for the following transaction data.







12

13

14
The association model XML schema specification follows.









15









16
The components of a PMML document consist of (the first two and last being
used in the example model above):

Data dictionary. Defines models input attributes with type and value
range.

Mining schema. Defines the attributes and roles specific to a particular
model.

Transformation dictionary. Defines the following mappings: normalization
(continuous or discrete values to numbers), discretization (continuous to
discrete values), value mapping (discrete to discrete values), aggregation
(grouping values as in SQL).

Model Statistics. Statistics about individual attributes.

Models. Includes regression models, cluster models, association rules,
neural networks, Bayesian models, sequence models.

PMML is used within the standards CWM , SQL/MM Part 6 Data Mining, J DM,
and MS Analysis Services (OLE DB for Data Mining) providing a degree of
compatibility between them all.
17
COMMON WAREHOUSE METAMODEL CWM

CWM supports the interchange of warehouse and business intelligence
metadata between warehouse tools, warehouse platforms and warehouse
metadata repositories in distributed heterogeneous environments.

http://www.omg.org/technology/documents/modeling_spec_catalog.htm


SQL/MM DATA MINING

The SQL Multimedia and Applications Package Standard (SQL/MM) Part 6
specifies an SQL interface to data mining applications and services through
SQL:1999 user-defined types as follows.

User-defined types for four data mining functions: association rules,
clustering, classification and regression.

18
Routines to manipulate these user-defined types to allow:

Setting parameters for mining activities.

Training of mining models, in which a particular mining technique is
chosen, parameters for that technique are set, and the mining model is built
with training data sets.

Testing of mining models applicable only to regression and classification
models, in which the trained model is evaluated by comparing with results
for known data.

Application of mining models in which the model is applied to new data to
cluster, predict or classify as appropriate. This phases is not applicable to
rule models in which rules are determined during the training phase.

User-defined types for data structures common across these data mining
models.

Functions to capture metadata for data mining input.

19
For example, for the association rule model type DM_RuleModel the following
methods are supported:

DM_impRuleModel
CHARACTER LARGE OBJECT (DM_MaxContentLength))

Import rule model expressed as PMML
Return DM_RuleModel

DM_expRuleModel()

Export rule model as PMML

DM_getNORules()

Return number of rules

DM_getRuleTask()

Return data mining task value, data mining settings
etc.

20
JAVA DATA MINING JDM

J ava Data Mining (http://www.jcp.org/en/jsr/detail?id=73) is a J ava API
developed under the J ava Community Process supporting common data
mining operations as well as the metadata supporting mining activities.

J DM 1.0 supports the following mining functions: classification, regression,
attribute importance (ranking), clustering and association rules.

J DM 1.0 supports the following tasks: model building, testing, application
and model import/export.

J DM does not support tasks such as data transformation, visualization and
mining unstructured data.

J DM has been designed so that metadata maps closely to PMML to provide
support for the generation of XML for mining models. Likewise, metadata maps
closely to CWM to support generation of XML for mining tasks.

The J DM API maps closely to SQL/MM Data Mining to support an
implementation of J DM on top of SQL/MM.

21
OLE DB FOR DATA MINING & DMX SQL SERVER ANALYSIS SERVICES

OLE DB for Data Mining, developed by Microsoft and incorporated in SQL
Server Analysis Services, specifies a structure for holding information defining a
mining model and a language for creating and working with these mining
models.

The approach has been to adopt an SQL-like framework for creating, training
and using a mining model a mining model is treated as though it is a special
kind of table. The DMX language, which is SQL-like, is used to create and work
with models.

22
CREATE MINING MODEL [AGE PREDICTION]
( [CUSTOMER ID] LONG KEY,
[GENDER] TEXT DISCRETE,
[AGE] DOUBLE DISCRETIZED() PREDICT,
[ITEM PURCHASES] TABLE
([ITEM NAME] TEXT KEY,
[ITEM QUANTITY] DOUBLE NORMAL CONTINUOUS,
[ITEM TYPE] TEXT RELATED TO [ITEM NAME]
)
)
USING [MS DECISION TREE]


The column to be predicted, AGE, is identified, together with the keyword
DISCRETIZED() indicating that a discretization into ranges of values is to take
place.

ITEM QUANTITY is identified as having a normal distribution, which may be
exploited by some mining algorithms.

23
ITEM TYPE is identified as being related to ITEM NAME. This reflects a 1-
many constraint each item has one type.

It can be seen from the column specification of the table inserted into, that a
nested table representation is used with ITEM PURCHASES iself being a table
nested within AGE PREDICTION. A conventional table representation would
result in duplicate data in a single non-normalized table or data in multiple
normalized tables.

The USING clause specifies the algorithm that will be used to construct the
model.

Having created a model, it may be populated with a caseset of training data
using an INSERT statement.

Predictions are obtained by executing a prediction join to match the trained
model with the caseset to be mined. This process can be thought of as
matching each case in the data to be mined with every possible case in the
trained model to find a predicted value for each case which matches a case in
the model.
24
SQL Server Analysis Services supports data mining algorithms for use with:

conventional relational tables

OLAP cube data

Mining techniques supported include:

classification - decision trees

clustering - k-means

association rule mining

Predictive Model Markup Language (PMML) is supported.

SQL Server Analysis Services Data Mining Tutorials
25
DATA MINING PRODUCTS OPEN SOURCE

A number of open-source packages and tools support data mining capabilities,
including R, Weka, RapidMiner and Mahout.


R is both a language for statistical computing and visualisation of results, and a
wider environment consisting of packages and other tools for the development
of statistical applications.

Data mining functionality is supported through a number of packages, including
classification with decision trees using the r par t package, clustering with k-
means using the kmeans package, and association rule mining with Apriori
using the ar ul es package.

http://www.r-project.org


Weka is a collection of data mining algorithms written in J ava including those for
classification, clustering and association rule mining as well as for visualisation.

http://www.cs.waikato.ac.nz/ml/weka/
26
RapidMiner consists of both tools for developing standalone data mining
applications and an environment for use of RapidMiner functions from other
programme languages.

Weka and R algorithms may be integrated within RapidMiner.

An XML-based interchange format is used to enable interchange of data
between data mining algorithms.

http://www.rapidminer.com


Mahout is an Apache project to develop data mining algorithms for the Hadoop
platform.

Core MapReduce algorithms for clustering, classification are provided, but the
project also incorporates algorithms designed to run on single-node
architectures and non-Hadoop cluster architectures.

http://mahout.apache.org
27
DATA MINING PRODUCTS ORACLE

Oracle supports data mining algorithms for use with conventional relational
tables.

Mining techniques supported include:

classification - decision trees, support vector machines...

clustering - k-means...

association rule mining Apriori

Predictive Model Markup Language (PMML) support is included

In addition to SQL and PL/SQL interfaces, until Oracle 11, a J ava API was
supported to allow applications to be developed which mine data. This was
Oracles implementation of J DM 1.0 introduced above.
28
From Oracle 12, the J ava API is no longer supported. Instead, support for R has
been introduced with the Oracle R Enterprise component.

Oracle R Enterprise allows R to be used to perform analysis on Oracle
database tables.

A collection of packages supports mapping of R data types to Oracle database
objects and the transparent rewriting of R expressions to SQL expressions on
those corresponding objects.

A related product is Oracle R Connector for Hadoop. This is an R package
which provides an interface between a local R environment and file system and
Hadoop enabling R functions to be executed on data in memory, on the local file
system and HDFS.


29
DATA MINING PRODUCTS SPSS MODELER

SPSS Modeler (formerly Clementine and PASW Modeler) is a data mining tool
from IBM. Using it you can:

Obtain data from a variety of sources

Select and transform data

Visualise the data using a variety of plots and graphs

Model the data with data mining methods including

classification - decision trees, support vector machines...

clustering - k-means...

association rule mining Apriori

Output the results in a variety of forms.

CRISP-DM and PMML support is included.

SPSS Modeler also supports integration with the data mining tools available
from database vendors including Oracle Data Miner, IBM DB2 InfoSphere
Warehouse, and Microsoft Analysis Services.


30
SPSS Modeler has a data stream approach enabling data to be processed by
a series of nodes performing operations on the data.




















31
In this data stream approach, data flows from:

data source nodes representing, for example, data files or database tables






via

operation nodes representing, for example selection, sampling or
aggregation operations






and

32

modelling nodes representing data mining methods, for example
classification, clustering and association rule mining methods





to

graph nodes representing the results in a variety of formats, for example,
data files, charts, plots or output nodes for further analysis or export nodes
for import by external applications.










It supports a language CLEM for specifying the operations for analyzing and
manipulating the data within nodes in the stream.


33

Two tutorials are referenced below.

The first tutorial is based on data recording the response of patients with the
same illness to various drugs used to treat that illness.

A stream is built which is used to analyse and visualise the data to identify the
relationship between the drugs administered and the levels of sodium (Na) and
potassium (K) measured in patients.

The second tutorial is based on market basket data for supermarket
transactions.

A stream is built to enable association rule mining (based on Apriori) to identify
links between items purchased.

A rule induction method C5.0, often used in the construction of decision trees, is
then applied to profile the purchasers of the product groups identified by the rule
mining.

34
READING

P Bradley et al., Scaling Mining Algorithms to Large Databases, CACM, 45(8),
38-43, 2002.

Y Chen et al., Practical Lessons of Data Mining at Yahoo!, Proc. CIKM09,
1047-1055, 2009.

J Lin & D Ryaboy, Scaling Big Data Mining Infrastructure: The Twitter
Experience, ACM SIGKDD Exploration 14(2), 6-19, 2012.

R Sumbaly et al., The Big Data Ecosystem at LinkedIn, Proc. SIGMOD13,
1125-1134. 2013.

SPSS Modeler Tutorial 1
SPSS Modeler Tutorial 2

FOR REFERENCE

IBM SPSS Modeler 15 User's Guide

You might also like