You are on page 1of 7

Clustering Large Datasets

Kankanala Laxmi
Supervisor : Prof. M. Narasimha Murthy
Computer Science and Automation
Indian Institute of Science, Bangalore

Abstract ables are ones that take on values across a dimensional


range. Nonmetric variables are those that change from
Applications in various domains often lead to very one categorical state to another.
large and frequently high-dimensional data; the di- A pattern classifier has two phases. They are: de-
mension of the data being in the hundreds or thou- sign phase where abstractions are created; and classifi-
sands, for example in text/web mining and bioin- cation phase, where the classification of test patterns is
formatics. In addition to the high dimensionality, done using these abstractions. Corresponding to these
these data sets are also often sparse. Clustering such two phases, we have design time and classification
large and high-dimensional data sets is a contempo- time. Classication based on neural networks[2] and
rary challenge. Successful algorithms must avoid the genetic algorithms[3] needs more design time. This is
curse of dimensionality but at the same time should because typically such classifiers access the training
be computationally efficient. Finding useful patterns data a large number of times. On the other hand, in
in large datasets has attracted considerable interest classification based on neighbourhood classifiers, there
recently. On the basis of quantitative measurements is no design phase; so, zero design time. However, the
on a set of objects, devise a method that assigns the classification phase could be computationally expen-
objects to meaningful subclasses. The primary goal of sive.
the project is to implement an efficient new tree based
clustering method, and combine the clustering meth- 1.2 Motivation
ods with classification. The implementation of the
algorithm involves many challenging issues like good With an increasing number of new database applica-
accuracy and less time. The algorithm incrementally tions dealing with very large high dimensional data
and dynamically clusters incoming multi-dimensional sets, data mining on such data sets has emerged as an
metric data points to try to produce the best quality important research area. These applications include
clustering with the available resources (i.e., available multimedia content-based retrieval, geographic and
memory and time constraints). We will evaluate the molecular biology data analysis, text mining, bioinfor-
time and space efficiency, data input order sensitivity, matics, medical applications, and time-series match-
and clustering quality through several experiments. ing. Clustering of very large high dimensional data
sets is an important problem. There are a number of
different clustering algorithms that are applicable to
1 Introduction very large data sets, and a few that address large high
dimensional data.
1.1 Background
1.3 Contributions
Clustering is the unsupervised classification of pat-
terns (observations, data items or feature vectors) The goals of this project and the work done in the
into groups (clusters)[1]. Data clustering identifies the direction of achieving them is described in this sec-
sparse and the crowded places, and hence discovers the tion. Our first goal is to come up with a new tree
overall distribution patterns of the dataset. Besides, based clustering algorithm, which is a combination of
the derived clusters can be visualized more efficiently CF tree and KD tree.Towards this direction, we have
and effectively than the original dataset. Generally, completed the analysis of CF-tree and KD-tree.
there are two types of attributes involved in the data Our second goal is to combine the clusterting algo-
to be clustered: metric and nonmetric. Metric vari- rithm with the classification problem. We are also
looking at the incremental mining and Divide and memory, and the result is very sensitive to input
Conquer approaches. order.

• DBSCAN
1.4 Outline of the Report Density Based Spatial Clustering of Application
This report is organized as follows. of Noise[6]
Section 2 provides a review of relevant work in this DBSCAN uses a density based notion of clusters
area.i.e.,two tree based clustering algorithms namely to discover clusters of arbitrary shapes. The key
CF tree, KD tree, and Divide and Conquer approach. idea is that, for each object of a cluster, the neigh-
Section 3 describes different classification approaches. bourhood of a given radius has to contain at least
Section 4 gives description about incremental mining. a minimum number of data objects. In other
Section 5 gives the details about different data sets, words, the density of the neighbourhood must ex-
which have been used for clustering. Section 6 has ex- ceed a threshold.
perimental results of our preliminary implementation The classification problem is concerned with generat-
of the two clustering algorithms. We conclude this ing a description or a model for each class from the
report with a road map to future work. given data set. Using the training set, the classifi-
cation process attempts to generate the description of
1.5 Review of Literature the classes, and those descriptions helps to classify the
unknown records. It is possible to use frequent itemset
We give a brief review of the relative work done in this mining algorithms to generate the clusters and their
area. There are two main approaches to clustering- descriptions efficiently. Two popular algorithms are:
heirarchical clustering and partitioning clustering.
The partitioning clustering technique partitions the • Apriori algorithm[7]
database into a predefined number of clusters. Par- It is the most popular algorithm to find all the fre-
titioning clustering algorithms are: k -means and k - quent sets. The first pass of the algorithm simply
medoid algorithms. The heirarchical clustering tech- counts item occurences to determine the frequent
nique produces a sequence of partitions, in which each itemsets. A subsequent pass, say pass k, consists
partition is nested into the next partition in the se- of two phases. First, the frequent item sets found
quence. It creates a heirarchy of clusters from small in the k -1 pass are used to generate the candidate
to big or big to small. Numerous algorithms have been itemsets. Next, the database is scanned and the
developed for clustering large data sets. Here we give support of candidates is counted.
a brief description about some of the algorithms.
• FP-tree
• K-means algorithm[4] A frequent pattern tree is a tree structure con-
The algorithm is composed of the following steps: sisting of an item-prefix-tree and a frequent-item-
1.Place K points into the space represented by header table.
the objects that are being clustered. These points Item-prefix-tree consists of a root node labelled
represent initial group centroids. null. Each non-root node consists of three fields:
2. Assign each object to the group that has the item name, support count and node link.
closest centroid. Frequent-item-header-table consists of item name
3. When all objects have been assigned, recalcu- and head of node link which points to the first
late the positions of the K centroids. node in the FP-tree carrying the item name.
4. Repeat Steps 2 and 3 until the centroids no
longer move. This produces a separation of the
objects into groups from which the metric to be 2 Clustering
minimized can be calculated.
Clustering is a method of grouping data into different
• CLARANS groups, so that the data in each group share similar
Clustering Large Application based on Random- trends and patterns. Clustering constitutes a major
ized Search[5] class of data mining algorithms. The algorithm at-
CLARANS is medoid-based method, which is tempts to automatically partition the data space into
more efficient, but it suffers from two major draw- a set of regions or clusters, to which the examples
backs:it assumes that all object fit into the main in the tables are assigned, either deterministically or
probability-wise. The goal of the process is to identify
all sets of similar examples in the data, in some opti-
mal fashion.
The objectives of clustering are:

• to uncover natural groupings

• to initiate hypothesis about the data

2.2 KD Tree
• to find consistent and valid organization of the
data A KD-Tree[11] is a data structure for storing a finite
set of points from a k -dimensional space. It was ex-
amined in detail by J Bentley Friedman et al.,1977.
In this section we briefly discuss two tree based clus- A KD Tree is a binary tree, designed to handle spa-
tering algoritms namely, CF tree and KD tree. And tial data in a simple way. At each step, choose one
we give a brief overview about Divide and Conquer of the coordinate as a basis of dividing the rest of the
approach. points. For example, at the root, choose x as the ba-
sis. Like binary search trees, all items to the left of
root will have the x-coordinate less than that of the
root All items to the right of the root will have the x-
2.1 CF Tree coordinate greater than (or equal to) that of the root.
Example: The KD Tree for (3,7) (5,10) (8,3) and
(6,12)
CF tree[8, 9] is based on the principle of agglomerative
clustering, that is, at any given stage there are smaller
3,7
subclusters and the decision at the current stage is to
merge subclusters based on some criteria. It maintains
set of cluster features for each subcluster.The Cluster
Features of different subclusters are maintained in a
+
8,3 5,10
tree (in a B tree fashion)[10], this tree is called CF
TREE.
CF vector = (n, ls, ss); where n is the number of data
objects in CF, ls is the linear sum of the data objects, 6,12
and ss is the square sum of the data objects in CF.
A CF tree is a height-balanced tree with two param- It is a 2d-tree of four elements.The (3,7) node splits
eters: branching factor B and threshold T. Each non- along the Y=7 plane and the (5,10) node splits along
leaf node contains at most B entries of the form [CFi , X=5 plane.
childi ], where i ∈ 1, 2, · · · , B, childi is a pointer to its The below figure shows the way nodes partition the
th
i child node, and CFi is the CF of the subcluster rep- plane
resented by this child. So a nonleaf node represents
a cluster made up of all the subclusters represented
by its entries. A leaf node contains at most, L en- (5,10) (6,12)
tries, each of the form [CFi ], where i = 1, 2, · · · , L.
In addition, each leaf node has two pointers, “prev”
and “next” which are used to chain all leaf nodes to- (3,7)
gether for efficient scans. A leaf node also represents a
cluster made up of all the subclusters represented by
its entries. But all entries in a leaf node must satisfy (8,3)
a threshold requirement, with respect to a threshold
value T the diameter has to be less than T.
2.3 Divide-and-Conquer of the major drawbacks of KNN classifiers is that the
classifier needs all available data. This may lead to
We study clustering under the data stream model of considerable overhead, if the training data set is large.
computation, where, given a sequence of points, the So to reduce that overhead, we will use the KNNC
objective is to maintain a consistent good clustering for the clusters, which have been generated by some
of the sequence observed so far, using a small amount clustering method.
of memory and time. One of the first requisites of In some clusters all points belongs to the same class.
clustering a data stream is that the computation be So if a pattern belongs to that cluster, we dont have
carried out in small space. to find KNN, because all points are in same class. We
Algorithm Small-Space(S):[12] will find KNNs for only those patterns which belongs
1. Divide S into l disjoint pieces X1 , X2 , ..., Xl . to border clusters i.e., the clusters which contains data
2. For each i, find O(k) centers in Xi . Assign each points of multtiple classes. This will reduce the com-
point in Xi to its closest center. plexity of classification. For each test pattern which
0
3. Let X be the O(lk) centers obtained in (2), where belong to border clusters, we will find K nearest train-
each center c is weighted by the number of points as- ing points with respect to Euclidean distance. Classi-
signed to it. fication label of that test pattern will be the majority
0
4. Cluster X to find k centers. among those K- neighbours.

3 Classification 3.2 Decision trees


Decision tree classifier is a predictive model which is
Classification involves finding rules that partition the
presented as a tree. A decision tree takes as input an
data into disjoint groups. The input for the classifi-
object described by a set of properties, and outputs a
cation is the training data set, whose class labels are
decision. Internal nodes of this tree are patterns and
already known. Classification analyzes the training
leaf nodes are categories. At each level of the tree, we
data set and constructs a model based on the class
select a feature on which we take decision. Partition-
label, and aims to assign a class label to the future
ing variables are selected automatically according to
unlabelled records. Since the class field is known, this
a given quality measurement criterion. Based on the
type of classification is known as supervised learning.
decision we split the set of patterns. The quality of
A set of classification rules are generated by such a
predictive models is measured in terms of accuracy on
classification process, which can be used to classify fu-
un-seen data. In decision trees, there are two factors
ture data and develop a better understanding of each
that we can measure:accuracy and number of nodes
class in the database[13].
In decision tree classification, we can use the fol-
After generating clusters, we can use any of the fol-
lowingtwo methods.
lowing methods for classifying those.
• Single decision tree: In this we construct only
3.1 KNNC one desion tree for all training patterns. So tree
construction time will not be reduced. But while
K-Nearest Neighbor (KNN)[14] classification is a very classifying, we use those rules only for border clus-
simple, yet powerful classification method. The key ter points. So classification time can be reduced.
idea behind KNN classification is that similar obser-
vations belong to similar classes. Thus, one simply has • Separate decision tree for each border cluster: In
to look for the class designators of a certain number of this we construct a separate decision tree for each
the nearest neighbors and weigh their class numbers border class. When we see the test point of those
to assign a class number to the unknown. The weigh- clusters, we use that cluster decision tree for iden-
ing scheme of the class numbers is often a majority tifying the class.
rule, but other schemes are conceivable. The number
of the nearest neighbors, k, should be odd in order to 3.3 CPAR
avoid ties, and it should be kept small, since a large k
tends to create misclassifications unless the individual CPAR: Classification based on Predictive Association
classes are well-separated. It can be shown that the Rules[15]
performance of a KNN classifier is always at least half It combines the advantages of both associative classi-
of the best possible classifier for a given problem. One cation and traditional rule-based classication. CPAR
inherits the basic idea of FOIL[16]. Instead of gener- changes in frequency caused by incremental updates.
ating a large number of candidate rules as in associa- The frequency of a node in the CanTree is at least
tive classication, CPAR adopts a greedy algorithm to as high as the sum of frequencies of its children. The
generate rules directly from training data. Moreover, construction of our proposed CanTree is independent
CPAR generates and tests more rules than traditional of the threshold values. Thus, it does not require such
rule-based classiers to avoid missing important rules. user thresholds as preMinsup. Since items are consis-
To avoid overfitting, CPAR uses expected accuracy to tently ordered in CanTree, any insertions, deletions,
evaluate each rule and uses the best k rules in predic- and/or modifications of transactions have no effect on
tion. We calculate laplace accuracy for eacu rule, and the ordering of items in the tree. As a result, swapping
for finding the best k-rules we use that. After finding of tree nodes which may lead to merging and splitting
best k-rules for each class among the satisfied rules, we of tree nodes is not needed.
will find the average laplace accuracy for each class. CanTree may take a large amount of memory. In
And we assign the label of a class, for which we get contrast, CanTrees significantly reduce computation
maximum average laplace accuracy. and time, because they easily find mergeable paths
In this algorithm, we consider all positive and neg- and require only upward path traversals. As a result,
ative examples for each class. If we have clusters of CanTrees provide users with efficient incremental min-
a single class, all pattern will be positive only. So it ing.
will not be useful for those clusters. So we take bor-
der clusters and generate rules for those clusters only.
This will reduce the number rules to be generated. 4.2 FP Tree
While classifying a pattern, we check the rules of that
Frequent Pattern Tree[9]
particular cluster for which the test pattern belongs.
A fundamental problem for mining association rules
is to mine frequent itemsets (FI’s). In a transaction
4 Incremental Mining database, if we know the support of all frequent item-
sets, the association rule generation is straightforward.
Historically, inductive machine learning has focused However, when a transaction database contains large
on non-incremental learning tasks, i.e., where the number of large frequent itemsets, mining all frequent
training set can be constructed a priori and learning itemsets might not be a good idea. An FP-tree (fre-
stops once this set has been duly processed. There quent pattern tree) is a variation of the trie data struc-
are, however, a number of areas, such as agents, where ture, which is a prefix-tree structure for storing crucial
learning tasks are incremental. Efficient and scalable and compressed information about frequent patterns.
approaches to data-driven knowledge acquisition from It consists of one root labeled as NULL, a set of item
distributed, dynamic data sources call for algorithms prefix sub-trees as the children of the root,and a fre-
that can modify knowledge structures (e.g., pattern quent item header table. Each node in the item prefix
classifiers) in an incremental fashion without having sub-tree consists of three fields: item-name, count, and
to revisit previously processed data (examples). In node-link, where item-name indicates which item this
this section we focus on CanTree, a tree structure for node represents,count indicates the number of trans-
efficient incremental mining of frequent patterns. The actions containing items in the portion of the path
construction of CanTree requires only one database reaching this node, and node-link links to the next
scan. node in the FP-tree carrying the same item-name, or
null if there is none. Each entry in the frequent item
header table consists of two fields, item name and head
4.1 CanTree of node-link. The latter points to the first node in the
In CanTree[17], items are arranged according to some FP-tree carrying the item name.
canonical order, which can be determined by the user CanTree takes large amount of memory. So we are
prior to the mining process or at runtime during the thinking of to combine CanTree with FP Tree.
mining process. Specifically, items can be consistently
arranged in lexicographic order or alphabetical or-
der. Alternatively, items can be arranged according 5 Datasets
to some specific order depending on the item proper-
ties (e.g., their price values, their validity of some con- This section gives the details about the data sets which
straints). The ordering of items is unaffected by the have been used for clustering.
5.1 KDDcup99 • CF-Tree Accuracy: 91.028
Time for tree construction: 4sec
Network intrusion detection data Time for testing: 3sec
1. KDD10 data • KD-Tree Accuracy: 96.93
0
This is KDDCUP 99 10% data Time for tree construction: 19sec
Training patterns: 494021 Time for testing: 4605sec
Test patterns: 311029 2. KDD100
Number of Attributes: 36
Number of classes: 36 • CF-Tree Accuracy: 98.018
But we convert that into a 5 class problem. Time for tree construction: 4sec
Time for testing: 26sec
2. KDD100 data • KD-Tree Accuracy: 99.93
0
This is KDDCUP 99 100% testing data, but Time for tree construction: 19sec
training data is 10% only. Number of classes is Time for testing: 46143sec
same as 10%.
Training patterns: 494021 3. SONAR
Test patterns: 4898431
• CF-Tree Accuracy: 87.0182
Time for tree construction: 0.006sec
5.2 Covtype Time for testing: 0.004sec
• KD-Tree Accuracy: 98
Number of Patterns: 581012
Time for tree construction: 0.01sec
Number of Attributes: 54
Time for testing: 0.02sec
Number of Classes: 8
Attribute breakdown: 54 columns of data - 10 quan- 4. Covtype
titative variables, 4 binary wilderness areas and 40
binary soil type variables • CF-Tree Accuracy: 73.437
Missing Attribute Values: None Time for tree construction: 8sec
Variable Information: Given is the variable name, Time for testing: 7sec
variable type, the measurement unit and a brief de- • KD-Tree Accuracy: 96.93
scription. Time for tree construction: 60.01sec
The forest cover type is the classification problem. Time for testing: 47.13sec
The order of this listing corresponds to the order of
numerals along the rows of the database. From the above results, it is seen that CF tree is gives
less accuracy than KD tree, but the time taken for
KD tree is very high. But CF Tree takes less space
5.3 SONAR than KD Tree. So the idea is to construct a new tree
This is the data set used by Gorman and Sejnowski in type data structure for clustering, for getting better
their study of the classification of sonar signals using accuracy than CF tree, and it should take less time
a neural network. than KD tree.
The task is to train a network to discriminate between
sonar signals bounced off a metal cylinder and those 7 Conclusions and Future Work
bounced off a roughly cylindrical rock.
Number of Attributes:60 Although, there are many clustering algorithms, cur-
Number of Training patterns: 104 rent clustering techniques do not address all the re-
Number of Testpatterns: 104 quirements adequately and concurrently, dealing with
Number of classes: 2 large number of dimensions and large number of data
items can be problematic because of time complexity.
So, as future work we are planning to implement
6 Results a new clustering algorithm which will perform better
than excisting algorithms, by using tree based cluster-
1. KDD10 ing and combining classification with clustering. We
are looking at divide and conquer approaches to work [9] Arun K Pujari. DATA MINING Techniques. Uni-
with large datasets. Also we are looking at incremen- versities Press, 2001.
tal mining for online training data sets.
[10] Ramez Elmasri and Shamkant B. Navathe.
Fundamentals of Database Systems. Ben-
References jamin/Cummings, Redwood City, 2 edition, 1994.

[1] A. K. Jain, M. N. Murty, and P. J. Flynn. [11] Andrew W. Moore. An intoductory tutorial on
Data clustering: a review. ACM Comput. Surv., kd-trees, October 08 1997.
31(3):264–323, 1999.
[12] S. Guha, N. Mishra, R. Motwani, and
[2] M. Prakash and M. N. Murty. Growing subspace L. O’Callaghan. Clustering data streams. In
pattern-recognition methods and their neural- Danielle C. Young, editor, Proceedings of the 41st
network models. IEEE Trans. Neural Networks, Annual Symposium on Foundations of Computer
8(1):161–168, January 1997. Science, pages 359–366, Los Alamitos, California,
November 12–14 2000. IEEE Computer Society.
[3] L. I. Kuncheva and L. C. Jain. Nearest neigh-
bor classifier: Simultaneous editing and feature [13] Bing Liu, Wynne Hsu, and Yiming Ma. Integrat-
selection. Pattern Recognition Letters, 20(11- ing classification and association rule mining. In
13):1149–1156, November 1999. KDD, pages 80–86, 1998.

[4] J. MacQueen. Some methods for classification [14] B. Zhang and S. N. Srihari. Fast K-nearest
and analysis of multivariate observations. In neighbor classification using cluster-based trees.
L. M. Le Cam and J. Neyman, editors, Proceed- IEEE Trans. Pattern Analysis and Machine In-
ings of the Fifth Berkeley Symposium on Math- telligence, 26(4):525–528, April 2004.
ematical Statistics and Probability, volume 1,
pages 281–297, Berkeley, Califonia, 1967. Univer- [15] Xiaoxin Yin and Jiawei Han. CPAR: Classi-
sity of California Press. fication based on predictive association rules.
In Daniel Barbará and Chandrika Kamath, edi-
[5] Raymond T. Ng and Jiawei Han. CLARANS: tors, Proceedings of the Third SIAM International
A method for clustering objects for spatial Conference on Data Mining, San Francisco, CA,
data mining. IEEE Trans. Knowl. Data Eng, USA, May 1-3, 2003. SIAM, 2003.
14(5):1003–1016, 2002.
[16] J. R. Quinlan and R. M. Cameron-Jones. FOIL:
[6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, A midterm report. In P. B Brazdil, editor,
and Xiaowei Xu. A density-based algorithm Machine Learning (ECML-93) European Confer-
for discovering clusters in large spatial databases ence on Machine Learning Proceedings; Vienna,
with noise. In Evangelos Simoudis, Jiawei Han, Austria, pages 3–20, Berlin, Germany, 1993.
and Usama Fayyad, editors, Second International Springer-Verlag.
Conference on Knowledge Discovery and Data
Mining, pages 226–231, Portland, Oregon, 1996. [17] Carson Kai-Sang Leung, Quamrul I. Khan, and
AAAI Press. Tariqul Hoque. Cantree: A tree structure for effi-
cient incremental mining of frequent patterns. In
[7] Rakesh Agrawal and Ramakrishnan Srikant. Fast ICDM, pages 274–281. IEEE Computer Society,
algorithms for mining association rules. Research 2005.
Report RJ 9839, IBM Almaden Research Center,
San Jose, California 95120, U.S.A., June 1994.
[8] Tian Zhang, Raghu Ramakrishnan, and Miron
Livny. BIRCH: An efficient data clustering
method for very large databases. In H. V. Ja-
gadish and Inderpal Singh Mumick, editors, Pro-
ceedings of the 1996 ACM SIGMOD Interna-
tional Conference on Management of Data, pages
103–114, Montreal, Quebec, Canada, 4–6 June
1996.

You might also like