Professional Documents
Culture Documents
1 Research
ABSTRACT
Clustering plays a vital role in research area in the field of data mining. Clustering is a process of
partitioning a set of data in a meaningful sub classes called clusters. It helps users to understand
the natural grouping of cluster from the data set. It is unsupervised classification that means it
has no predefined classes. Data are grouped into clusters in such a way that data of the same
group are similar and those in other groups are dissimilar. It aims to minimize intra-class
similarity while to maximize interclass dissimilarity. Clustering is useful to obtain interesting
patterns and structures from a large set of data. Clustering can be applied in many areas, such as
marketing studies, DNA analyses, city planning, text mining, and web documents classification.
Large datasets with many attributes make the task of clustering complex. Many methods have
been developed to deal with these problems. There are number of techniques proposed by
several researchers to analyze the performance of clustering algorithms in data mining. All these
techniques are not suggesting good results for the chosen data sets and for the algorithms in
particular. The choice of clustering algorithm depends both on the type of data available and on
the particular purpose and application. Clustering analysis is one of the main analytical methods
in data mining. Some of the clustering algorithms are suit for some kind of input data. In this
paper, three well known partitioning based methods k-means, K-medoid and enhanced K-medoid
are compared. The study given here explores the behavior of these three methods. Our
experimental results have shown that enhanced K-medoid performed better than k-means and Kmedoid in terms of cluster quality and elapsed time.
Index Terms: - Clustering, Classification, Partition Clustering, K-means, K-medoid, enhanced K-medoid.
I. INTRODUCTION
This study is aimed to give a comparative review of three of the various partitioning based clustering
methods. Clustering is a division of data objects into groups of similar objects. Such groups are called
clusters. Objects possessed by same cluster tend to be similar, while dissimilar objects are possessed by
different clusters. These clusters represent groups of data and provide simplification by representing
many data objects by fewer clusters. And, this helps to model data by its clusters.
Clustering is similar to classification in which data are grouped. A cluster is therefore a collection of
objects which are similar between them and are dissimilar to the objects belonging to other clusters.
There exist a large number of clustering algorithms in the literature. The choice of clustering algorithm
depends both on the type of data available and on the particular purpose and application. The data
clustering is a big problem in a wide variety of different areas, like pattern recognition & bio-informatics.
Clustering is a data description method in data mining which collects most similar data. The purpose is to
organize a collection of data items in to clusters, such that items within a cluster are more similar to each
7 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
The number of partitions to be constructed (k) this type of clustering method creates an initial partition.
Then it moves the object from one group to another using iterative relocation technique to find the global
optimal partition. A good partitioning is one in which distance between objects in the same partition is
small (related to each other) whereas the distance between objects of different partitions is large (they
are very different from each other).
A. K-Means Clustering
This segment describes the original K-Means clustering algorithm. The idea is to classify a given set of
data into k number of transfer clusters, where the value of k is fixed in advance. The algorithm consists of
two separate phases: the first stage is to define k centroids, one for each cluster. The next stage is to take
each point belonging to the given data set and associate it to the nearest centroid. Euclidean distance is
generally considered to determine the distance between data points and the centroids. When all the
points are included in some clusters, the first step is completed and an early grouping is done. At this
point we need to recalculate the new centroids, as the inclusion of new points may lead to a change in the
cluster centroids. Once we find k new centroids, a new binding is to be created between the same data
points and the nearest new centroid, generating a loop. As a result of this loop, the k centroids may
change their position in a step by step manner. Eventually, a situation will be reached where the
centroids do not move anymore. This signifies the convergence criterion for clustering.
The process, which is called K-Means, appears to give partitions which are reasonably efficient in the
sense of within class variance, corroborated to some extend by mathematical analysis and practical
experience. Also, the K-Means procedure is easily programmed and is computationally economical, so
that it is feasible to process very large samples on a digital computer. K-Means algorithm is one of first
which a data analyst will use to investigate a new data set because it is algorithmically simple, relatively
robust and gives good enough answers over a wide variety of data sets.
B. K-Mediod Clustering
K-medoid is a partition clustering algorithm which needs to select k clustering centers from data objects
and establish an initial partition nearest to clustering centre for other data before iterating and moving
www.ijafrc.org
8 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
[2]
A. K. Jain, M. N. Murty, and P. J. Flynn Data clustering: a review. ACM Computing Surveys, Vol
.31No 3,pp.264 323, 1999.
[3]
J. Han and M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,
August 2000.
[4]
Rui Xu, Donlad Wunsch, Survey of Clustering Algorithm, IEEE Transactions on Neural Networks,
Vol. 16, No. 3, may 2005.
[5]
Sanjay garg, Ramesh Chandra Jain, Variation of k-mean Algorithm: A study for High Dimensional
Large data sets, Information Technology Journal5 (6), 1132 1135, 2006.
[6]
K. A. Abdul Nazeer, M. P. Sebastian Improving the Accuracy and Efficiency of the K-Means
Clustering Algorithm Proceedings of the World Congress on Engineering 2009 Vol - I WCE 2009,
July 1 - 3, 2009, London, U.K.
[7]
[8]
D. Klein, S.D. Kamvar, and C. Manning, From instance level Constraints to Space-level
Constraints: Making the Most of Prior Knowledge in Data Clustering, In Proc. ICML02, Sydney,
Australia.
www.ijafrc.org