You are on page 1of 6

Interval ckMeans: An Algorithm for Clustering Symbolic Data

ABSTRACT:
Clustering is the process of organizing a collection of patterns into groups based on their similarities. Fuzzy clustering techniques aim at finding groups to which every object in the database belongs to some membership degree. This paper presents a new algorithm for clustering symbolic data based on ckMeans algorithm. This new algorithm allows the data entry and the membership degree to be intervals. In order to validate the proposal, it is compared to two other algorithms using the same database.

EXISTING SYSTEM:
Even though dynamic clustering method used in large database like web page collection which yields better clustering, but it needs additional computation which leads to increase in time complexity. And also when dynamic document clustering adopted for real world applications, sometimes it may not yield the desired output. And also dynamic algorithm works like static algorithm in initial clustering.

PROPOSED SYSTEM:
An approach for dynamic document clustering based on structured MARDL technique is our objective. At first the documents are clustered in Static method using Bisecting K-means algorithm. For clustering of documents in bisecting K-Means, all documents should be preprocessed in the initial stage. The preprocessing stage includes stop word removal process and stemming process. In stop word removal process, words having negative influence like adverbs, conjunctions are removed and in stemming process root word will find out by removing prefixes and suffixes of the word. After the preprocessing process, the documents should grouped into desired number of clusters. To make desired number of clusters, bisecting K-Means clustering method is used. In this method, each document is assigning a weight by term frequency and inverse document frequency method using cosine similarity measure. After assigning weight to each document, the documents are first separated into clusters using kMeans method. After clustering of documents using K-means method the largest cluster will split and forms two sub clusters and this step would be repeated for many times until clusters formed are with high similarity. The overall process is explained in the diagram below.

HARDWARE REQUIREMENTS

SYSTEM HARD DISK MONITOR MOUSE RAM KEYBOARD

: Pentium IV 2.4 GHz : 40 GB : 15 VGA colour : Logitech. : 256 MB : 110 keys enhanced.

SOFTWARE REQUIREMENTS

Operating system Front End Tool

: : :

Windows XP Professional JAVA NETBEANS IDE

MODULES
Preprocessing Bisecting K-means Proposed Dynamic Algorithm

MODULE DESCRIPTION
Preprocessing In preprocessing, stop words removal and stemming process can be done. Stop words are usually given as a word list. Most of these words are conjunctions or adverbs which have no contribution to cluster process will have negative influence. Words with high frequency which can be gotten in word frequency dictionary appear in most documents, so they are not helpful for cluster either. Such words can be removed in stop word removal process and output will send to stemming process. In stemming process we will find the root word by removing prefixes and suffixes of the word. We will use Porter Stemmer Algorithm for the stemmer process.

Bisecting K-means

The Bisecting k-Means clustering algorithm requires the number of clusters k and the documents to be given as input. Hence the clustering is based on the number of clusters given as an input. If the input document doesnt match any domain i.e. if its not similar, then the document is clustered into a separate group of similar documents. According to the number of clusters as input, the clusters are formed as Cluster 0, Cluster1 up to Cluster k.

Each document belong to one or the other category of domains including computer, medicine, mathematics, thermodynamics etc., These clusters further act as data points for the next level of hierarchy which in turn iterates until one global cluster is achieved. It is observed that the quality of the cluster increases when the number of clusters increases.

Proposed Dynamic Algorithm

After the clustering of given documents into desired number of clusters, the dynamic document clustering algorithm has been used to assign a new document into a cluster with high frequency value. The algorithm takes a sample from each cluster and new document are -compared with each sample and we calculate the frequency weight age of the new document

with each cluster. Then the dynamic document clustering algorithm assigns the new document to the cluster with high frequency value if the frequency value is within the threshold value. Frequency value is calculated using sentence importance calculation of newly arrived document docj with each sample and it is clustered with which cluster it has highest frequency value.

REFERENCE:
Rogerio R. de Vargas, Benjamin R. C. Bedregal, Interval ckMeans: An Algorithm for Clustering Symbolic Data, IEEE Ref.: 978-1-61284-968-3/11. IEEE Conference 2011.

You might also like