1. Data mining algorithms have traditionally assumed data fits in memory, but techniques have been developed to efficiently handle large datasets not in memory, such as by partitioning data. Standards have also been created for sharing and exchanging data mining models.
2. These include the CRISP-DM process standard and the PMML format standard, which defines an XML schema for representing various data mining models like association rules to enable model sharing.
3. The SQL/MM standard specifies SQL interfaces and routines for manipulating common data mining functions like association rules, clustering, classification and regression to integrate them with databases.
1. Data mining algorithms have traditionally assumed data fits in memory, but techniques have been developed to efficiently handle large datasets not in memory, such as by partitioning data. Standards have also been created for sharing and exchanging data mining models.
2. These include the CRISP-DM process standard and the PMML format standard, which defines an XML schema for representing various data mining models like association rules to enable model sharing.
3. The SQL/MM standard specifies SQL interfaces and routines for manipulating common data mining functions like association rules, clustering, classification and regression to integrate them with databases.
1. Data mining algorithms have traditionally assumed data fits in memory, but techniques have been developed to efficiently handle large datasets not in memory, such as by partitioning data. Standards have also been created for sharing and exchanging data mining models.
2. These include the CRISP-DM process standard and the PMML format standard, which defines an XML schema for representing various data mining models like association rules to enable model sharing.
3. The SQL/MM standard specifies SQL interfaces and routines for manipulating common data mining functions like association rules, clustering, classification and regression to integrate them with databases.
The basic data mining algorithms introduced may be enhanced in a number of ways.
Basic data mining algorithms have traditionally assumed data is memory resident, as for example in the case of the Apriori algorithm for association rule mining or basic clustering algorithms.
Hence, techniques to ensure efficient data mining algorithms with large data sets have been developed.
Another problem is that once a data mining model has been developed, there has traditionally been no mechanism for that model to be reused programmatically by other applications on other data sets.
Hence, standards for data mining model exchange have been developed.
2 This trend has been accelerated as interoperability issues become of increasing importance to enable the deployment of cloud computing data mining applications.
Finally, even for data which is held in a database or data warehouse, data mining has traditionally been performed by dumping data from the database to an external file, which is then transformed and mined.
This results in a series of files for each data mining application, with the resulting problems of data redundancy, inconsistency and data dependence, which database technology was designed to overcome.
Hence, techniques and standards for tighter integration of database and data mining technology have been developed.
3 DATA MINING OF LARGE DATA SETS
Algorithms for classification, clustering, and association rule mining are considered.
CLASSIFYING LARGE DATA SETS: SUPPORT VECTOR MACHINES
To reduce the computational cost of solving the SVM optimization problem with large training sets, chunking is used. This partitions the training set into chunks each of which fits into memory. The support vector parameters are computed iteratively for each chunk. However, multiple passes of the data are required to obtain an optimal solution.
Another approach is to use squashing. In this, the SVM is trained over clusters derived from the original training set, with the clusters reflecting the distribution of the original training records.
A further approach reformulates the optimization problem to allow solution by efficient iterative algorithms.
4 CLUSTERING LARGE DATA SETS: K-MEANS
Unless there is sufficient main memory to hold the data being clustered, the data scan at each iteration of the K-means algorithm will be very costly.
An approach for large databases would
Perform at most one scan of the database.
Work with limited memory.
Approaches include the following:
Identify three kinds of data objects: those which are discardable because membership of a cluster has been established; those which are compressible which while not discardable belong to a well-defined subcluster which can be characterized in a compact structure; those which are neither discardable nor compressible which must be retained in main memory.
Alternatively, first group data objects into microclusters and then perform k-means clustering on those microclusters. 5 An approach developed at Microsoft uses such approaches as follows:
1. Read a sample subset of data from the database.
2. Cluster that data with the existing model as usual to produce an updated model.
3. On the basis of the updated model, decide for each data item from the sample whether it needs to be retained in memory discarded with summary information being updated retained in a compressed form as summary information.
4. Repeat from 1 until termination condition met.
6 ASSOCIATION MINING OF LARGE DATA SETS: APRIORI
With one database scan for each itemset size tested, the cost of scans would be prohibitive for the Apriori algorithm unless the database is resident in memory.
Approaches to enhance the efficiency of Apriori include the following.
While generating 1-itemsets for each transaction, generate 2-itemsets at the same time, hashing the 2-itemsets to a hash table structure.
All buckets whose final count of itemsets is less than the minimum support threshold can be ignored subsequently since any itemset therein will itself not have the minimum required support.
7 Overlap the testing of k-itemsets and (k+1)-itemsets by counting (k+1)- itemsets in parallel with counting k-itemsets.
Unlike in conventional Apriori in which candidate (k+1)-itemsets are only generated after the k-itemset database scan, in this approach a database scan is divided into blocks before any of which candidate (k+1)-itemsets can be generated during the k-itemset scan.
Two database scans only are needed if a partitioning approach is adopted under which transactions are divided into n partitions each of which can be held in memory.
In the first scan frequent itemsets for each partition are generated. These are combined to create a candidate frequent itemsets list for the database as a whole. In the second scan, the actual support for members of the candidate frequent itemsets list is checked.
8 Pick a random sample of transactions which will fit in memory and search for frequent itemsets in that sample.
This may result in some global frequent itemsets being missed. The chance of this happening can be lessened by adopting a lower minimum support threshold for the sample, with the remaining database then being used to check actual support for the candidate itemsets.
A second database scan may be needed to ensure no frequent itemsets have been missed. 9 DATA MINING STANDARDS
Data mining standards and related standards for data grids, web services and the semantic web enable the easier deployment of data mining applications across platforms.
Standards cover:
The overall KDD process.
Metadata interchange with data warehousing applications.
The representation of data cleaning, data reduction and transformation processes.
The representation of data mining models.
APIs for performing data mining processes from other languages including SQL and J ava.
10 CRISP-DM
CRISP-DM (CRoss Industry Standard Process for Data Mining) specifies a process model covering the following 6 phases of the KDD process:
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
www.the-modeling-agency.com/crisp-dm.pdf 11 PREDICTIVE MODEL MARKUP LANGUAGE PMML
PMML is an XML-based standard developed by the Data Mining Group (www.dmg.org) which is a consortium of data mining product vendors.
PMML represents data mining models as well as operations for cleaning and transforming data prior to modeling.
The aim is to enable an application to produce a data mining model in a form PMML XML - which another data mining application can read and apply.
Below is shown the PMML representation of an example association rules model for the following transaction data.
12
13
14 The association model XML schema specification follows.
15
16 The components of a PMML document consist of (the first two and last being used in the example model above):
Data dictionary. Defines models input attributes with type and value range.
Mining schema. Defines the attributes and roles specific to a particular model.
Transformation dictionary. Defines the following mappings: normalization (continuous or discrete values to numbers), discretization (continuous to discrete values), value mapping (discrete to discrete values), aggregation (grouping values as in SQL).
Model Statistics. Statistics about individual attributes.
Models. Includes regression models, cluster models, association rules, neural networks, Bayesian models, sequence models.
PMML is used within the standards CWM , SQL/MM Part 6 Data Mining, J DM, and MS Analysis Services (OLE DB for Data Mining) providing a degree of compatibility between them all. 17 COMMON WAREHOUSE METAMODEL CWM
CWM supports the interchange of warehouse and business intelligence metadata between warehouse tools, warehouse platforms and warehouse metadata repositories in distributed heterogeneous environments.
The SQL Multimedia and Applications Package Standard (SQL/MM) Part 6 specifies an SQL interface to data mining applications and services through SQL:1999 user-defined types as follows.
User-defined types for four data mining functions: association rules, clustering, classification and regression.
18 Routines to manipulate these user-defined types to allow:
Setting parameters for mining activities.
Training of mining models, in which a particular mining technique is chosen, parameters for that technique are set, and the mining model is built with training data sets.
Testing of mining models applicable only to regression and classification models, in which the trained model is evaluated by comparing with results for known data.
Application of mining models in which the model is applied to new data to cluster, predict or classify as appropriate. This phases is not applicable to rule models in which rules are determined during the training phase.
User-defined types for data structures common across these data mining models.
Functions to capture metadata for data mining input.
19 For example, for the association rule model type DM_RuleModel the following methods are supported:
DM_impRuleModel CHARACTER LARGE OBJECT (DM_MaxContentLength))
Import rule model expressed as PMML Return DM_RuleModel
DM_expRuleModel()
Export rule model as PMML
DM_getNORules()
Return number of rules
DM_getRuleTask()
Return data mining task value, data mining settings etc.
20 JAVA DATA MINING JDM
J ava Data Mining (http://www.jcp.org/en/jsr/detail?id=73) is a J ava API developed under the J ava Community Process supporting common data mining operations as well as the metadata supporting mining activities.
J DM 1.0 supports the following mining functions: classification, regression, attribute importance (ranking), clustering and association rules.
J DM 1.0 supports the following tasks: model building, testing, application and model import/export.
J DM does not support tasks such as data transformation, visualization and mining unstructured data.
J DM has been designed so that metadata maps closely to PMML to provide support for the generation of XML for mining models. Likewise, metadata maps closely to CWM to support generation of XML for mining tasks.
The J DM API maps closely to SQL/MM Data Mining to support an implementation of J DM on top of SQL/MM.
21 OLE DB FOR DATA MINING & DMX SQL SERVER ANALYSIS SERVICES
OLE DB for Data Mining, developed by Microsoft and incorporated in SQL Server Analysis Services, specifies a structure for holding information defining a mining model and a language for creating and working with these mining models.
The approach has been to adopt an SQL-like framework for creating, training and using a mining model a mining model is treated as though it is a special kind of table. The DMX language, which is SQL-like, is used to create and work with models.
22 CREATE MINING MODEL [AGE PREDICTION] ( [CUSTOMER ID] LONG KEY, [GENDER] TEXT DISCRETE, [AGE] DOUBLE DISCRETIZED() PREDICT, [ITEM PURCHASES] TABLE ([ITEM NAME] TEXT KEY, [ITEM QUANTITY] DOUBLE NORMAL CONTINUOUS, [ITEM TYPE] TEXT RELATED TO [ITEM NAME] ) ) USING [MS DECISION TREE]
The column to be predicted, AGE, is identified, together with the keyword DISCRETIZED() indicating that a discretization into ranges of values is to take place.
ITEM QUANTITY is identified as having a normal distribution, which may be exploited by some mining algorithms.
23 ITEM TYPE is identified as being related to ITEM NAME. This reflects a 1- many constraint each item has one type.
It can be seen from the column specification of the table inserted into, that a nested table representation is used with ITEM PURCHASES iself being a table nested within AGE PREDICTION. A conventional table representation would result in duplicate data in a single non-normalized table or data in multiple normalized tables.
The USING clause specifies the algorithm that will be used to construct the model.
Having created a model, it may be populated with a caseset of training data using an INSERT statement.
Predictions are obtained by executing a prediction join to match the trained model with the caseset to be mined. This process can be thought of as matching each case in the data to be mined with every possible case in the trained model to find a predicted value for each case which matches a case in the model. 24 SQL Server Analysis Services supports data mining algorithms for use with:
conventional relational tables
OLAP cube data
Mining techniques supported include:
classification - decision trees
clustering - k-means
association rule mining
Predictive Model Markup Language (PMML) is supported.
SQL Server Analysis Services Data Mining Tutorials 25 DATA MINING PRODUCTS OPEN SOURCE
A number of open-source packages and tools support data mining capabilities, including R, Weka, RapidMiner and Mahout.
R is both a language for statistical computing and visualisation of results, and a wider environment consisting of packages and other tools for the development of statistical applications.
Data mining functionality is supported through a number of packages, including classification with decision trees using the r par t package, clustering with k- means using the kmeans package, and association rule mining with Apriori using the ar ul es package.
http://www.r-project.org
Weka is a collection of data mining algorithms written in J ava including those for classification, clustering and association rule mining as well as for visualisation.
http://www.cs.waikato.ac.nz/ml/weka/ 26 RapidMiner consists of both tools for developing standalone data mining applications and an environment for use of RapidMiner functions from other programme languages.
Weka and R algorithms may be integrated within RapidMiner.
An XML-based interchange format is used to enable interchange of data between data mining algorithms.
http://www.rapidminer.com
Mahout is an Apache project to develop data mining algorithms for the Hadoop platform.
Core MapReduce algorithms for clustering, classification are provided, but the project also incorporates algorithms designed to run on single-node architectures and non-Hadoop cluster architectures.
http://mahout.apache.org 27 DATA MINING PRODUCTS ORACLE
Oracle supports data mining algorithms for use with conventional relational tables.
Mining techniques supported include:
classification - decision trees, support vector machines...
clustering - k-means...
association rule mining Apriori
Predictive Model Markup Language (PMML) support is included
In addition to SQL and PL/SQL interfaces, until Oracle 11, a J ava API was supported to allow applications to be developed which mine data. This was Oracles implementation of J DM 1.0 introduced above. 28 From Oracle 12, the J ava API is no longer supported. Instead, support for R has been introduced with the Oracle R Enterprise component.
Oracle R Enterprise allows R to be used to perform analysis on Oracle database tables.
A collection of packages supports mapping of R data types to Oracle database objects and the transparent rewriting of R expressions to SQL expressions on those corresponding objects.
A related product is Oracle R Connector for Hadoop. This is an R package which provides an interface between a local R environment and file system and Hadoop enabling R functions to be executed on data in memory, on the local file system and HDFS.
29 DATA MINING PRODUCTS SPSS MODELER
SPSS Modeler (formerly Clementine and PASW Modeler) is a data mining tool from IBM. Using it you can:
Obtain data from a variety of sources
Select and transform data
Visualise the data using a variety of plots and graphs
Model the data with data mining methods including
classification - decision trees, support vector machines...
clustering - k-means...
association rule mining Apriori
Output the results in a variety of forms.
CRISP-DM and PMML support is included.
SPSS Modeler also supports integration with the data mining tools available from database vendors including Oracle Data Miner, IBM DB2 InfoSphere Warehouse, and Microsoft Analysis Services.
30 SPSS Modeler has a data stream approach enabling data to be processed by a series of nodes performing operations on the data.
31 In this data stream approach, data flows from:
data source nodes representing, for example, data files or database tables
via
operation nodes representing, for example selection, sampling or aggregation operations
and
32
modelling nodes representing data mining methods, for example classification, clustering and association rule mining methods
to
graph nodes representing the results in a variety of formats, for example, data files, charts, plots or output nodes for further analysis or export nodes for import by external applications.
It supports a language CLEM for specifying the operations for analyzing and manipulating the data within nodes in the stream.
33
Two tutorials are referenced below.
The first tutorial is based on data recording the response of patients with the same illness to various drugs used to treat that illness.
A stream is built which is used to analyse and visualise the data to identify the relationship between the drugs administered and the levels of sodium (Na) and potassium (K) measured in patients.
The second tutorial is based on market basket data for supermarket transactions.
A stream is built to enable association rule mining (based on Apriori) to identify links between items purchased.
A rule induction method C5.0, often used in the construction of decision trees, is then applied to profile the purchasers of the product groups identified by the rule mining.
34 READING
P Bradley et al., Scaling Mining Algorithms to Large Databases, CACM, 45(8), 38-43, 2002.
Y Chen et al., Practical Lessons of Data Mining at Yahoo!, Proc. CIKM09, 1047-1055, 2009.
J Lin & D Ryaboy, Scaling Big Data Mining Infrastructure: The Twitter Experience, ACM SIGKDD Exploration 14(2), 6-19, 2012.
R Sumbaly et al., The Big Data Ecosystem at LinkedIn, Proc. SIGMOD13, 1125-1134. 2013.