You are on page 1of 153

Visual Cluster Analysis in Data Mining

A thesis submitted in fulfilment of the requirements for the Doctorial Degree of Philosophy by

Ke-Bing Zhang
October 2007

Master (Honours) of Science (Macquarie University) 2002 Bachelor of Engineering (Tianjin University of Technology) 1987

Department of Computing Division of Information and Communication Sciences Macquarie University, NSW 2109, Australia 2007 Ke-Bing Zhang

To my parents

Zhang Lu-de and Zhang Xiao-lin

ACKNOWLEDGMENT
First of all, I am deeply indebted to my supervisor, Associate Professor Mehmet A. Orgun, for providing me with supervision, motivation and encouragement throughout the course of this work. His insight, breadth of knowledge and enthusiasm has been invaluable to my training as a researcher. He led me to the correct direction at every stage of the research. Without his care, supervision and friendship, I would not be able to complete this work.

I am also grateful to my co-supervisor Professor Kang Zhang, for his suggestions and guidance of my research. His help has been integrated into the success of this work.

I am faithfully indebted to my parents Zhang Lu De and Zhang Xiao Lin for their love, forever affection, patience and constant encouragement. I deeply thank my wife Liu Yu for her love and understanding, especially for her comforting me when I encountered sorrow and loneliness.

I would like to express my appreciation to my brother, Professor Kewei Zhang for his comments on the mathematics in this work.

Finally, my thanks also go to other faculty and staff members of the Department of Computing, and to my fellow graduate students, for providing a friendly and enjoyable environment during my time here.

DECLARATION

I hereby certify that the work embodied in this thesis is the result of original research. This work has not been submitted for a higher degree to any other university or institution.

Signed:

____________________

Date :

____________________

II

ABSTRACT
Cluster analysis is a widely applied technique in data mining. However, most of the existing clustering algorithms are not efficient in dealing with arbitrarily shaped distribution data of extremely large and high-dimensional datasets. On the other hand, statistics-based cluster validation methods incur very high computational cost in cluster analysis which prevents clustering algorithms from being effectively used in practice. Visualization techniques have been introduced into cluster analysis. However, most visualization techniques employed in cluster analysis are mainly used as tools for information rendering, rather than for investigating how data behavior changes with the variations of the parameters of the algorithms. In addition, the impreciseness of visualization limits its usability in contrasting grouping information of data. This thesis proposes a visual approach called HOV3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis. HOV3 employs quantified domain knowledge, statistical measures, and explorative observation as predictions to project high dimensional data onto 2D space for revealing the gaps of data distribution against the predictions. Based on the capability of quantified measurement of HOV3, this thesis also proposes a visual external cluster validation method to verify the stability of clustering results by comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV3. With this method, data miners can perform an intuitive visual assessment and have a precise evaluation of the consistency of the cluster structure. This thesis also introduces a visual approach called M-HOV3 to enhance the visual separation of clusters based on the projection technique of HOV3. With enhanced separation of clusters, data miners can explore cluster distribution intuitively as well as dealing with cluster validation effectively in HOV3. As a consequence, with the advantage of the quantified measurement feature of HOV3, data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of clusters effectively in the postprocessing stage of clustering in data mining.

III

Table of Contents
LIST OF FIGURES ................................................................................................................. VI CHAPTER 1. INTRODUCTION ............................................................................................ 1 CHAPTER 2. CONTRIBUTIONS .......................................................................................... 5 CHAPTER 3. CLUSTER ANALYSIS .................................................................................. 11 3.1 Clustering and Clustering Algorithms.......................................................................... 11 3.1.1 Partitioning methods........................................................................................... 12 3.1.2 Hierarchical methods ......................................................................................... 14 3.1.3 Density-based methods....................................................................................... 16 3.1.4 Grid-based methods............................................................................................ 17 3.1.5 Model-based clustering methods........................................................................ 18 3.2 Cluster Validation......................................................................................................... 19 3.2.1 Internal criteria ................................................................................................... 19 3.2.2 Relative criteria .................................................................................................. 21 3.2.3 External criteria .................................................................................................. 22 3.3 The issues of cluster analysis ...................................................................................... 23 CHAPTER 4. VISUAL CLUSTER ANALYSIS .................................................................. 25 4.1 Multidimensional Data Visualization........................................................................... 26 4.1.1 Icon-based techniques ........................................................................................ 26 4.1.2 Pixel-oriented Techniques................................................................................. 28 4.1.3 Geometric Techniques....................................................................................... 31 4.2 Visual Cluster Analysis ............................................................................................... 34 4.2.1 MDS and PCA .................................................................................................... 34 4.2.2 HD-Eye.............................................................................................................. 35 4.2.3 Grand Tour ........................................................................................................ 37 4.2.4 Hierarchical BLOB............................................................................................ 38 4.2.5 SOM .................................................................................................................. 39 4.2.6 FastMap ............................................................................................................. 40 4.2.7 OPTICS .............................................................................................................. 40 4.2.8 Star Coordinates and VISTA............................................................................. 41 4.3 Major Challenges ......................................................................................................... 45 4.3.1 Requirements of Visualization in Cluster Analysis ........................................... 45 4.3.2 Motivation .......................................................................................................... 45 4.4 Our Approach ............................................................................................................... 46 4.4.1 HOV3 Model....................................................................................................... 47 4.4.2 External Cluster Validation by HOV3 ................................................................ 48 4.4.3 Enhanced the Separation of Clusters By HOV3 ................................................. 49 4.4.4 Prediction-based Cluster Analysis by HOV3...................................................... 50 IV

CHAPTER 5. CONCLUSION AND FUTURE WORK ....................................................... 53 5.1 Conclusion..................................................................................................................... 53 5.2 Future Work.................................................................................................................. 54 5.2.1 Three Dimensional HOV3 .................................................................................. 55 5.2.2 Dynamic Visual Cluster Analysis ...................................................................... 55 5.2.3 Quasi-Cluster Data Points Collection................................................................. 56 5.2.4 Combination of Fuzzy Logical approaches and HOV3 ...................................... 57 APPENDIX ............................................................................................................................... 59 BIBLIOGRAPHY .................................................................................................................... 61

List of Figures
Figure 3-1 An Example of clustering procure of K-means [HaK01] .....................................13 Figure 3-2 Hierarchical Clustering Process [HaK01] .........................................................15 Figure 3-3 The grid-cell structure of gird-based clustering methods .....................................17 Figure 3-4 External criteria based validation [ZOZ07a].........................................................22 Figure 4-1. An example of Chernoff-Faces ............................................................................27 Figure 4-2. Stick Figure Visualization Technique..................................................................27 Figure 4-3. Stick Figure Visualization of the Census Data ....................................................28 Figure 4-4. Displaying attribute windows for data with six attributes ...................................29 Figure 4-5. Illustration of the Recursive Pattern Technique...................................................30 Figure 4-6. The Recursive Pattern Technique in VisDB [KeK94].........................................30 Figure 4-7. Scatterplot-Matrices [Cle93]................................................................................31 Figure 4-8. 15,000 coloured data items in Parallel Coordinates ............................................32 Figure 4-9. Star plots of data items [SGF71].........................................................................33 Figure 4-10. Clustering of 1352 genes in MDS by [Bes] .......................................................35 Figure 4-12. The framework of HD-Eye system and its different visualization projections..36 Figure 4-13. The 3D data structures in HD-Eye and their intersection trails on the planes ...36 Figure 4-14. The Grand Tour Technique and its 3D example................................................38 Figure 4-15. Cluster hierarchies are shown for 1, 5, 10 and 20 clusters [SBG00] .................38 Figure 4-16. Model matching with SOM by [KSP01]............................................................39 Figure 4-17. Data structure mapped in Gaussian bumps by OPTICS [ABK+99] ..................40 Figure 4-18. Clustering structure of 30,000 16-Dimensional data items Visualized by OPTICS.....41 Figure 4-19. Positioning a point by an 8-attribute vector in Star Coordinates [Kan01] ......................42 Figure 4-20 axis scaling, angle rotation and foot print functions of Star Coordinates [Kan01] ...............43

VI

VII

CHAPTER 1. INTRODUCTION

CHAPTER 1 INTRODUCTION
Clustering analysis, also called segmentation analysis or taxonomy analysis [MaW], aims to identify homogeneous objects into a set of groups, named clusters, by given criteria. Clustering is a very important technique of knowledge discovery for human beings. It has a long history and can be traced back to the times of Aristotle [HaJ97]. These days, cluster analysis is mainly conducted on computers to deal with very large-scale and complex datasets. With the development of computer-based techniques, clustering has been widely used in data mining, ranging from web mining, image processing, machine learning, artificial intelligence, pattern recognition, social network analysis, bioinfomatics, geography, geology, biology, psychology, sociology, customers behavior analysis, marketing to e-business and other fields [JMF99] [Har75]. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering achieves to distinguish objects into groups according to certain criteria. The grouped objects are called clusters, where the similarity of objects is high within clusters and low between clusters. To achieve different application purposes, a large number of clustering algorithms have been developed [JMF99, Ber06]. However, there are no general-purpose clustering algorithms that fit all kinds of applications, thus, the evaluation of the quality of clustering results plays the critical role of cluster analysis, i.e., cluster validation, which aims to assess the quality of clustering results and find a fit cluster scheme for a specific application.

CHAPTER 1. INTRODUCTION
However, in practice, it may not always be possible to cluster huge datasets by using clustering algorithms successfully, due to the weakness of most existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets. As Abul et al pointed out In high dimensional space, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well anymore [AAP+03]. In addition, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation [HCN01]. The clustering of large sized datasets in data mining is an iterative process involving humans [JMF99]. Thus, the users initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the users clear understanding on cluster distribution is helpful for assessing the quality of clustering results in the post-processing of clustering. All these heavily rely on the users visual perception of data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Therefore, introducing visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. Visualization used in cluster analysis maps the high-dimensional data to a 2D or 3D space and aids users having an intuitive and easily understood graph/image to reveal the grouping relationship among the data. As an indispensable revealing technique, visualization is almost involved into every step in data mining [Chi00, AnK01]. Visual cluster analysis is a combination of visualization and cluster analysis. The data sets that clustering algorithms deal

CHAPTER 1. INTRODUCTION
with are normally in high dimensions (>3D). Thus, choosing a fit technique to visualize clusters of high dimensional data is the first task of visual cluster analysis. There have been many works on multidimensional data visualization [WoB94], but those earlier techniques of multidimensional data visualization are not suitable to visualize cluster structures in very high dimensional and very large datasets. With the increasing applications of clustering in data mining, in the last decade, more and more visualization techniques have been developed to study the structure of datasets in the applications of cluster analysis [OlL03, Shn05]. Several approaches have been proposed for visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00], but their arbitrary exploration of group information makes them inefficient and time consuming in the cluster exploration stage. On the other hand, the impreciseness of visualization limits its utilisation in quantitative verification and validation of clustering results. Thus developing a visualization technique, with the features of purposeful cluster detection and precise contrast between clustering results is the motivation of this research. To mitigate the above-mentioned problems, based on hypothesis testing, we propose a visual projection technique called Hypothesis Oriented Verification and Validation by Visualization, (HOV3) [ZOZ06a, ZOZ06b]. HOV3 generalizes random adjustments of Star Coordinates based techniques as measure vectors. Thus, compared with the Star Coordinates based techniques, HOV3 has several superiorities. First, data miners can summarize their prior knowledge of the studied data as measure vectors, i.e., hypotheses of the data. Base on hypothesis testing, data miners can quantitatively analyze data distribution projected by HOV3 with hypotheses. Second, HOV3 avoids the arbitrariness and randomness of most existing

CHAPTER 1. INTRODUCTION
visual techniques on cluster exploration, for example Star Coordinates [Kan01] and its implementations, such as VISTA/iVIBRATE [ChL04, ChL06]. As a consequence, HOV3 provides data miners a purposeful and effective visual method on cluster analysis [ZOZ06b]. Based on the quantified measurement feature of HOV3, we propose a visual external cluster validation model to verify the consistency of cluster structures. Compared with statistics based external cluster validation methods, we show that HOV3 based external cluster validation model is more intuitive and effective [ZOZ07a]. We also introduce a visual approach called M-HOV3/M-Mapping to enhance the visual separation of clusters [ZOZ07b]. With the above features of HOV3, a prediction-based visual approach is proposed to explore and verify clusters [ZOZ07c, ZOZ07d]. The next chapter presents more detailed contributions of this thesis. This thesis is structured as follows: Chapter 2 summarizes the contributions of this thesis. Chapter 3 gives an introduction to clustering, clustering algorithms and cluster validation. Chapter 4 reviews related work on high dimensional data visualization and the visual techniques that have been used in cluster analysis. Finally, Chapter 5 summarise the work in the thesis and discusses future work.

CHAPTER 2. CONTRIBUTIONS

CHAPTER 2 CONTRIBUTIONS
This is a publication-based thesis. Its main contributions have been published in the proceedings of five international conferences. A followup report has also been submitted to an established journal for further publication. Below, we summarise the contributions of the thesis in the chronological order of those publications:

1. Hypothesis Oriented Verification and Validation by Visualization (HOV3) model:


To fill the gap between imprecise cluster detection by visualization and the unintuitive result often obtained by clustering algorithms, a novel visual projection technique called Hypothesis Oriented Verification and Validation by Visualization, HOV3, is proposed [ZOZ+06, ZOZ06]. The aim of interactive visualization on cluster exploration and rendering is to aid data miners to have some visually separated groups or full-separated clustering result of data. For example, Star Coordinates and its extensions provide such interaction by tuning the weight value of each axis (axis scaling in Star Coordinates [Kan01], -adjustment in VISTA/iVIBRATE [ChL04, ChL06]), but their arbitrary and random adjustments limit their applicability. HOV3 generalizes these adjustments as a coefficient/measure vector [ZOZ06]. Compared with the Star Coordinates model and its implementations, such as VISTA/iVIBRATE, it is observed that HOV3 has better performance on cluster detection. This is because HOV3 provides data miners a mechanism to quantify their knowledge or hypotheses as measure vectors for precisely exploring grouping information.

CHAPTER 2. CONTRIBUTIONS
As a consequence, HOV3 provides a bridge between qualitative analysis and quantitative analysis. Based on the idea of obtaining group clues by contrasting a dataset against quantified measures, HOV3 synthesizes the feedbacks from exploration discovery and users domain knowledge to produce quantified measures, and then projects the test dataset against the measures. Geometrically, HOV3 reveals the data distribution against the measures in visual form. This approach not only inherits the intuitive and easily understood features of visualization, but also avoids the weaknesses of randomness and arbitrary exploration of the existing visual methods employed in data mining. [ZOZ+06] K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the working conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp. 254-257 (2006) [ZOZ06] K-B, Zhang, M. A. Orgun, K. Zhang, HOV3: An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp.316-327 (2006)

2. An Algorithm for External Cluster Validation based on Data Distribution Matching:


This part of the work starts with the assumption that If two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high. With the quantified measurement feature of HOV3, an external cluster validation based on distribution matching is proposed to verify the consistency of cluster structures between the clustered subset and non-clustered subsets of a large dataset [ZOZ07a]. In this approach, a clustered subset from a dataset is chosen as a visual model to verify the similarity of cluster structures between the model and the other same-sized non-clustered subsets from the dataset by projecting them together in HOV3. As a consequence, the user

CHAPTER 2. CONTRIBUTIONS
can utilize the well-separated clusters produced by scaling axes in HOV3 as a model to pick out their corresponding quasi-clusters, where the points overlap clusters. In addition, instead of using statistical methods to assess the similarity between the two subsets, this approach simply computes the overlapping rate between the clusters and their quasi-clusters to show their consistency. The experiments show that when the HOV3 based external cluster validation method is introduced into cluster analysis, it can have more effective cluster validation results than those obtained from pure clustering algorithms, for example K-means, with the statistics-based validation methods. [ZOZ07a]. [ZOZ07a] K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press, pp. 576-582 (2007)

3. M-HOV3/M-Mapping, Enhanced the Separation of Clusters:


To visually separate overlapping clusters, an approach called M-HOV3/M-Mapping is introduced to enhance the separation of clusters by HOV3 [ZOZ07b, ZOZ07c]. Technically, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV3 to a data set, then the application of M-HOV3/M-mapping with the measure vector to the data set would lead to the groups being more contracted and have a good separation. These features of M-HOV3/M-mapping are significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of M-HOV3/M-mapping keeps the data points within a cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-HOV3/M-mapping can extend the distance of far data points relatively further. With the advantage of the enhanced separation and contraction features

CHAPTER 2. CONTRIBUTIONS
of M-HOV3/M-mapping, the user can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of data points among the clusters effectively in the post-processing stage of clustering by M-HOV3/M-mapping. [ZOZ07b] K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp. 285-297 (2007)

4. Prediction-based Cluster Detection by HOV3:


With the quantified measurement of HOV3 and enhanced separation features of M-HOV3, the user not only can summarise their historically explored knowledge about datasets as predictions but also directly introduce abundant statistical measurements of the studied data as predictions to investigate cluster clues, or refine clustering results purposefully and effectively [ZOZ07c, ZOZ07d]. In fact, prediction-based cluster detection by statistical measurements in HOV3 leads to more purposeful cluster exploration, and it gives an easier geometrical interpretation of the data distribution. In addition, with the statistical predictions in HOV3 the user may even be able to expose cluster clues that are not easy to be found by random cluster exploration. [ZOZ07c] K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 17-21, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp. 336349 (2007)

5. Prediction-based Cluster Validation by HOV3:


When mapping high-dimensional data into two-dimensional space, there may be overlapping data and ambiguities in visual form, therefore, to separate clusters from a lot

CHAPTER 2. CONTRIBUTIONS
of overlapping data points is an aim of this thesis. Based on the works such as M-HOV3/M-mapping and HOV3 with statistical measurement, the measures that resulted in fully separated clusters can be treated as predictions to be introduced into External Cluster Validation based on Data Distribution Matching by HOV3. In principle, any linear transformation, even a complex linear transformation, can be employed into HOV3 if it can separate clusters well. With the well-separated clusters, we can improve the efficiency of external cluster validation by HOV3 [ZOZ07c, ZOZ07d]. [ZOZ07d] K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (submitted)

CHAPTER 2. CONTRIBUTIONS

10

CHAPTER 3. CLUSTER ANALYSIS

CHAPTER 3 CLUSTER ANALYSIS


Cluster analysis is an exploratory discovery process. It can be used to discover structures in data without providing an explanation/interpretation [JaD88]. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at partitioning objects into groups according to a certain criteria. To achieve different application purposes, a large number of clustering algorithms have been developed [JaD88, KaR90, JMF99, Ber06]. While, due to there are no general-purpose clustering algorithms to fit all kinds of applications, thus, it is required an evaluation mechanism to assess the quality of clustering results that produced by different clustering algorithms or a clustering algorithm with different parameters, so that the user may find a fit cluster scheme for a specific application. The quality assessment process of clustering results is regarded as cluster validation. Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain knowledge to databases. In this chapter, we give a review of cluster analysis as the background of this thesis. First we introduce clustering, clustering algorithms and their features, and also the drawbacks of these algorithms. This is followed by the introduction of cluster validation, existing the cluster validation methods, and the problems with the existing cluster validation approaches.

3.1 Clustering and Clustering Algorithms


Clustering is considered as an unsupervised classification process [JMF99]. The clustering problem is to partition a dataset into groups (clusters) so that the data elements within a

11

CHAPTER 3. CLUSTER ANALYSIS


cluster are more similar to each other than data elements in different clusters by given criteria. A large number of clustering algorithms have been developed for different purposes [JaD88, KaR90, JMF99, XuW05, Ber06]. Based on the strategy of how data objects are distinguished, clustering techniques can be broadly divided in two classes: hierarchical clustering techniques and partitioning clustering techniques [Ber02]. However there is no clear boundary between these two classes. Some efforts have been done on the combination of different clustering methods for dealing with specific applications. Beyond the two traditional hierarchical and partitioning classes, there are several clustering techniques that are categorized into independent classes, for example, density-based methods, Grid-based methods and Modelbased clustering methods [HaK01, Ber06, Pei]. A short review of these methods is described below.

3.1.1 Partitioning methods


Partitioning clustering algorithms, such as K-means [Mac67], K-medoids PAM [KaR87], CLARA [KaR90] and CLARANS [NgH94] assign objects into k (predefined cluster number) clusters, and iteratively reallocate objects to improve the quality of clustering results. K-means is the most popular and easy-to-understand clustering algorithm [Mac67]. The main idea of K-means is summarised in the following steps: Arbitrarily choose k objects to be the initial cluster centers/centroids; Assign each object to the cluster associated with the closest centroid; Compute the new position of each centroid by the mean value of the objects in a cluster; and Repeat Steps 2 and 3 until the means are fixed. Figure 3-1 presents an example of the process of K-means clustering algorithm.

12

CHAPTER 3. CLUSTER ANALYSIS

Figure 3-1 An Example of clustering procure of K-means [HaK01] However, K-means algorithm is very sensitive to the selection of the initial centroids, in other words, the different centroids may produce significant differences of clustering results. Another drawback of K-means is that, there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple solution would be to compare the results of multiple runs with different k numbers and choose the best one according to a given criterion, but when the data size is large, it would be very time consuming to have multiple runs of K-means and the comparison of clustering results after each run. Instead of using the mean value of data objects in a cluster as the center of the cluster, a variation of K-means, K-medoids calculates the medoid of the objects in each cluster. The process of K-medoids algorithm is quite similar as K-means. Whereas, K-medoids clustering algorithm is very sensitive to outliers. Outliers could seriously influences clustering results. To solve this problem, some efforts have been made based on K-medoids, for example PAM (Partitioning Around Medoids) was proposed by Kaufman and Rousseeuw [KaR87]. PAM

13

CHAPTER 3. CLUSTER ANALYSIS


inherits the features of K-medoids clustering algorithm. Meanwhile, PAM equips a medoids swap mechanism to produce better clustering results. PAM is more robust than k-means in terms of handling noise and outliers, since the medoids in PAM are less influenced by outliers. With the O(k(n-k)2) computational cost for each iteration of swap (where k is the cluster number, n is the items of the data set), it is clear that PAM only performs well on small-sized datasets, but does not scale well to large datasets. In practice, PAM is embedded in the statistical analysis systems, such as SAS, R, S+ and etc. to deal with the applications of large sized datasets, i.e., CLARA (Clustering LARge Applications) [KaR90]. By applying PAM to multiple sampled subsets of a dataset, for each sample, CLARA can produce the better clustering results than PAM in larger data sets. But the efficiency of CLARA depends on the sample size. On the other hand, a local optimum clustering of samples may not the global optimum of the whole data set. Ng and Han [NgH94] abstracts the mediods searching in PAM or CLARA as searching k subgraphs from n points graph, and based on this understanding, they propose a PAM-like clustering algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search). While PAM searches the whole graph and CLARA searches some random sub-graphs, CLARANS randomly samples a set and selects k medoids in climbing sub-graph mountains. CLARANS selects the neighboring objects of medoids as candidates of new medoids. It samples subsets to verify medoids in multiple times to avoid bad samples. Obviously, multiple time sampling of medoids verification is time consuming. This limits CLARANS from clustering very large datasets in an acceptable time period.

3.1.2 Hierarchical methods

14

CHAPTER 3. CLUSTER ANALYSIS


Hierarchical clustering algorithms assign objects in tree-structured clusters, i.e., a cluster can have data points or representatives of low level clusters [HaK01]. Hierarchical clustering algorithms can be classified into categories according their clustering process: agglomerative and divisive. The process of agglomerative and divisive clustering are exhibited in Figure 3-2.

Figure 3-2 Hierarchical Clustering Process [HaK01] Agglomerative: one starts with each of the units in a separate cluster and ends up with a single cluster that contains all units. Divisive: to start with a single cluster of all units and then form new clusters by dividing those that had been determined at previous stages until one ends up with clusters containing individual units. AGNES (Agglomerative Nesting) adopts agglomerative strategy to merge clusters [KaR90]. AGNET arranges each object as a cluster at the beginning, then merges them as upper level clusters by given agglomerative criteria step-by-step until all objects form a cluster, as shown in Figure 3-2. The similarity between two clusters is measured by the similarity function of the closest pair of data points in the two clusters, i.e., single link. DIANA (Divisive ANAlysis)

15

CHAPTER 3. CLUSTER ANALYSIS


adopts an opposite merging strategy, it initially puts all objects in one cluster, then splits them into several level clusters until each cluster contains only one object [KaR90]. The merging/splitting decisions are critical in AGNES and DIANA. On the other hand, with O(n2) computational cost, their application is not scalable to very large datasets. Zhang et al [ZRL96] proposed an effective hierarchical clustering method to deal with the above problems, BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies). BIRCH summarizes an entire dataset into a CF-tree and then runs a hierarchical clustering algorithm on a multi-level compression technique, CF-tree, to get the clustering result. Its linear scalability is good at clustering with a single scan and its quality can be further improved by a few additional scans. It is an efficient clustering method on arbitrarily shaped clusters. But BIRCH is sensitive to the input order of data objects, and can also only deal with numeric data. This limits its stability of clustering and scalability in real world applications. CURE uses a set of representative points to describe the boundary of a cluster in its hierarchical algorithm [GRS98]. But with the increase of the complexity of cluster shapes, the number of representative points increases dramatically in order to maintain the precision. CHAMELEON [KHK99] employs a multilevel graph partitioning algorithm on the k-Nearest Neighbour graph, which may produce better results than CURE on complex cluster shapes for spatial datasets. But the high complexity of the algorithm prevents its application on higher dimensional datasets.

3.1.3 Density-based methods


The primary idea of density-based methods is that for each point of a cluster the neighborhood of a given unit distance contains at least a minimum number of points, i.e. the

16

CHAPTER 3. CLUSTER ANALYSIS


density in the neighborhood should reach some threshold [EKS+96]. However, this idea is based on the assumption of that the clusters are in the spherical or regular shapes. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was proposed to adopt densityreachability and density-connectivity for handling the arbitrarily shaped clusters and noise [EKS+96]. But DBSCAN is very sensitive to the parameter Eps (unit distance or radius) and MinPts (threshold density), because before doing cluster exploration, the user is expected to estimate Eps and MinPts. DENCLUE (DENsity-based CLUstEring) is a distribution-based algorithm [HiK98], which performs well on clustering large datasets with high noise. Also, it is significantly faster than existing density-based algorithms, but DENCLUE needs a large number of parameters. OPTICS is good at investigating the arbitrarily shaped clusters, but its non-linear complexity often makes it only applicable to small or medium datasets [ABK+99].

3.1.4 Grid-based methods


The idea of grid-based clustering methods is based on the clustering-oriented query answering in multilevel grid structures. The upper level stores the summary of the information of its next level, thus the grids make cells between the connected levels, as illustrated in Figure 3-3.

Figure 3-3 The grid-cell structure of gird-based clustering methods

17

CHAPTER 3. CLUSTER ANALYSIS


Many grid-based methods have been proposed, such as STING (Statistical Information Grid Approach) [WYM97], CLIQUE [AGG+98], and the combination of grid-density based technique WaveCluster [SCZ98]. The grid-based methods are efficient on clustering data with the complexity of O(N). However the primary issue of grid-based techniques is how to decide the size of grids. This quite depends on the users experience.

3.1.5 Model-based clustering methods


Model-based clustering methods are based on the assumption that data are generated by a mixture of underlying probability distributions, and they optimize the fit between the data and some mathematical model, for example statistical approach, neural network approach and other AI approaches. The typical techniques in this category are Autoclas [CKS+88], DENCLUE [HiK98] and COBWEB [Fis87].When facing an unknown data distribution, choosing a suitable one from the model based candidates is still a major challenge. On the other hand, clustering based on probability suffers from high computational cost, especially when the scale of data is very large. Based on the above review, we can conclude that, the application of clustering algorithms to detect grouping information in real world applications in data mining is still a challenge, primarily due to the inefficiency of most existing clustering algorithms on coping with arbitrarily shaped distribution of data of extremely large and high-dimensional datasets. Extensive survey papers on clustering techniques can be found in the literature [JaD88, KaR90, JMF99, EML01, XuW05, Ber06].

18

CHAPTER 3. CLUSTER ANALYSIS

3.2 Cluster Validation


A large number of clustering algorithms have been developed to deal with specific applications [JMF99]. Several questions arise: which clustering algorithm is best suitable for the application at hand? How many clusters are there in the studied data? Is there a better cluster scheme? These questions are related with evaluating the quality of clustering results, that is, cluster validation. Cluster validation is a procedure of assessing the quality of clustering results and finding a fit cluster strategy for a specific application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns [HBV02]. Cluster validation is an indispensable process of cluster analysis, because no clustering algorithm can guarantee the discovery of genuine clusters from real datasets and that different clustering algorithms often impose different cluster structures on a data set even if there is no cluster structure present in it [Gor98] [Mil96]. Cluster validation is needed in data mining to solve the following problems [HCN01]: 1. To measure a partition of a real data set generated by a clustering algorithm. 2. To identify the genuine clusters from the partition. 3. To interpret the clusters. Generally speaking, cluster validation approaches are classified into the following three categories Internal approaches, Relative approaches and External approaches [ALA+03]. We give a short introduction of cluster validation methods as follows.

3.2.1 Internal criteria

19

CHAPTER 3. CLUSTER ANALYSIS


Internal cluster validation is a method of evaluating the quality of clusters when statistics are devised to capture the quality of the induced clusters using the available data objects only [VSA05]. In other words, internal cluster validation excludes any information beyond the clustering data, and only focuses on assessing clusters quality based on the clustering data themselves. The statistical methods of quality assessment are employed in internal criteria, for example, root-mean-square standard deviation (RMSSTD) is used for compactness of clusters [Sha96]; R-squared (RS) for dissimilarity between clusters; and S_Dbw for compound evaluation of compactness and dissimilarity [HaV01]. The formulas of RMSSTD, RS and S_Dbw are shown below.

(3-1)

where, x j is the expected value in the jth dimension; nij is the number of elements in the ith cluster jth dimension; nj is the number of elements in the jth dimension in the whole data set; nc is the number of clusters. (3-2)

where,

(3-3) The formula of S_Dbw is given as: S_Dbw = Scat(c) + Dens_bw(c) where Scat(c) is the average scattering within c clusters. The Scat(c) is defined as: (3-4)

20

CHAPTER 3. CLUSTER ANALYSIS

(3-5) The value of Scat(c) is the degree of the data points scattered within clusters. It reflects the compactness of clusters. The term variance of cluster ci. Dens_bw(c) indicates the average number of points between the c clusters (i.e., an indication of inter-cluster density) in relation with density within clusters. The formula of Dens_bw is given as: is the variance of a data set; and the term is the

(3-6)

where uij is the middle point of the distance between the centres of the clusters vi and vj. The density function of a point is defined as the number of points around a specific point within the given radius.

3.2.2 Relative criteria


Relative assessment compares two structures and measures their relative merit. The idea is to run the clustering algorithm for a possible number of parameters (e.g., for each possible number of clusters) and identify the clustering scheme that best fits the dataset [ALA+03], i.e., they assess the clustering results by applying an algorithm with different parameters on a data set and finding the optimal solution. In practice, relative criteria methods also use RMSSTD, RS and S_Dbw to find the best cluster scheme in terms of compactness and dissimilarity from all the clustering results. Relative cluster validity is also called cluster

21

CHAPTER 3. CLUSTER ANALYSIS


stability, and the recent works on research of relative cluster validity are presented in [KeC00, LeD01, BEG02, RBL+02, BeG03].

3.2.3 External criteria


The results of a clustering algorithm are evaluated based on a pre-specified structure, which reflects the users intuition about the clustering structure of the data set [HKK05]. As a necessary post-processing step, external cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, and compare it with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Figure 3- 4.

Figure 3-4 External criteria based validation [ZOZ07a] External cluster validation is based on the assumption that an understanding of the output of the clustering algorithm can be achieved by finding a resemblance of the clusters with existing classes [Dom01], [KDN+96], [Ran71], [FoM83], [MSS83]. The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [Ran71], Jaccard Coefficient [Jac08], Folkes and Mallows index [MSS83], Huberts statistic and Normalized statistic [Thk99], and Monte Carlo method

22

CHAPTER 3. CLUSTER ANALYSIS


[Mil81], to measure the similarity between the priori modelled partitions and clustering results of a dataset. Extensive surveys on cluster validation can be found in the literature [JaD88, MiI96, Gor98, JMF99, Thk99, HBV01, HBV02, HKK05].

3.3 The issues of cluster analysis


By the survey of cluster analysis above, it is clear that there are two major drawbacks that influence the feasibility of cluster analysis in real world applications in data mining. The first one is the weakness of most existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets. The second issue is that, the evaluation of the quality of clustering results by statistics-based methods is time consuming when the database is large, primarily due to the drawback of very high computational cost of statistics-based methods for assessing the consistency of cluster structure between the sampling subsets. The implementation of statistics-based cluster validation methods does not scale well in very large datasets. On the other hand, arbitrarily shaped clusters also make the traditional statistical cluster validity indices ineffective, which leaves it difficult to determine the optimal cluster structure [HBV02]. In addition, the inefficiency of clustering algorithms on handling arbitrarily shaped clusters in extremely large datasets directly impacts the effect of cluster validation, because cluster validation is based on the analysis of clustering results produced by clustering algorithms. Moreover, most of the existing clustering algorithms tend to deal with the entire clustering process automatically, i.e., once the user sets the parameters of algorithms, the clustering

23

CHAPTER 3. CLUSTER ANALYSIS


result is produced with no interruption, which excludes the user until the end. As a result, it is very hard to incorporate user domain knowledge into the clustering process. Cluster analysis is a multiple runs iterative process, without any user domain knowledge, it would be inefficient and unintuitive to satisfy specific requirements of application tasks in clustering. Visualization techniques have proven to be of high value in exploratory data analysis and data mining [Shn01]. Therefore, the introduction of domain experts knowledge supported by visualization techniques is a good remedy to solve those problems. A detailed review of visualization techniques used in cluster analysis is presented in the next chapter.

24

CHAPTER 4. VISUAL CLUSTER ANALYSIS

CHAPTER 4 VISUAL CLUSTER ANALYSIS


As described in the last chapter, most of the existing automated clustering algorithms suffer in terms of efficiency and effectiveness on dealing with arbitrarily shaped cluster distributions of extremely large and multidimensional datasets [AAP+03]. Baumgartner et al [BPR+04] concluded that In high dimensional space, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well anymore. Another obstacle to the application of cluster analysis in data mining is that, the high computational cost of statistics-based cluster validation methods [HCN01]. These drawbacks limit the usability of clustering algorithms in real-world data mining applications. To mitigate the above problems, visualization has been introduced into cluster analysis. As Card et al [CMS99] described, visualization is the use of computer-supported interactive, and visual representation of abstract data to amplify cognition. Visualization is considered as one of the most intuitive methods for cluster detection and validation, especially performing well on the representation of irregularly shaped clusters. Visual data mining is the use of visualization techniques to allow data miners and analysts to evaluate, monitor, and guide the inputs, products and process of data mining [GHK+96]. As a branch of visual data mining, visual cluster analysis is a combination of information visualization and cluster analysis techniques. In the cluster analysis process, visualization provides analysts with intuitive feedback on data distribution and support decision-making activities. As a consequence, visual presentations can be very powerful in revealing trends,

25

CHAPTER 4. VISUAL CLUSTER ANALYSIS

highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Nowadays, introducing visualization techniques to explore and understand high-dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. A large number of visualization techniques have been developed to map multidimensional datasets to two or three dimensional space [WTP+95, AhW95, HKW99, RSE99, RKJ+99, ABK+99, AEK00, SGB00, MRC02, FGW02, Shn01, PGW03, LKS+04]. In practice, most of them simply take information visualization as a layout problem, therefore, they are not suitable to visualize clusters of very large datasets.

4.1 Multidimensional Data Visualization


Many efforts have been performed on multidimensional (d>3) data visualization [OlL03]. However, most of those visual approaches have difficulty in dealing with high dimensional and very large datasets. We give a more detailed discussion of them as follows.

4.1.1 Icon-based techniques


Icon-based presentations are relatively older techniques for visual data mining. The idea of icon-based techniques is to map each multidimensional data item as an icon, for example [Pic70, Che73, Bed90, Lev91, KeK94, Hea95]. We explain several popular techniques below. Chernoff Faces

A well-known iconic approach is Chernoff faces [Che73]. The Chernoff face uses the two dimensions of multidimensional data to locate a face position in the two display dimensions. The remaining dimensions are mapped to the properties of the face icon, i.e., the shape of nose, mouth, eyes, and the shape of the face itself, as shown in Figure 4-1.

26

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Chernoff face visualization capitalizes on the human sensitivity to faces and facial features. However, the number of data items that can be visualized using the Chernoff face technique is quite limited.

Figure 4-1. An example of Chernoff-Faces Stick Figures

Another famous icon-based technique is to use stick figures for visualizing a larger amounts of data, therefore, an adequate number of data items can be presented for data mining purposes [Pic70][PiG88]. The stick figures technique uses two dimensions as the display dimensions, and the other dimensions are mapped to the angles and lengths of the stick figure icon, as illustrated in Figure 4-2a. Different stick figure icons with variable dimensionality may be used, as shown in Figure 4-2b.

a. Stick Figure Icon

b. A Family of Stick Figures

Figure 4-2. Stick Figure Visualization Technique

27

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Figure 4-3 shows the census data of 1980 United States visualized by the stick figure visualization technique, and the census data have five dimensions. In Figure 4-3, where income and age are used as the display space, and other the attributes: occupation, education level, marital status and sex are visualized by the stick figures. However, it can be observed that, in Figure 4-3, the user cannot easily understand and interpret the graph of stick figures. The user has to have a good training in advance.

Figure 4-3. Stick Figure Visualization of the Census Data Many other icon-based systems have also been proposed, such as Shape-Coding [Bed90], Color Icons [Lev91, KeK94], and TileBars [Hea95]. Icon-based techniques can display multidimensional properties of data, however, with the amount of data increasing, the user hardly makes any sense of most properties of data intuitively, this is because the user cannot focus on the details of each icon when the data scale is very large.

4.1.2 Pixel-oriented Techniques


Pixel-oriented visualization techniques map each attribute value of data to a single colored pixel, yielding the display of the most possible information at a time [KeK94, KKA95, Kei97,

28

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Ank01]. With this technique, each data value is mapped to a colored pixel and present the data values belonging to one attribute in separate windows, as displayed in Figure 4-4. Pixel-oriented techniques use various colour mapping approaches, such as linear variation of brightness, maximum variation of hue (colour) and constant maximum saturation to map each data value to a colored pixel and arrange them adequately in limited space. Pixel-oriented techniques are powerful to provide an overview of large amounts of data, and meanwhile they preserve the perception of small regions of interest. This feature makes them suitable for being used in a variety of data mining tasks of extremely large databases.

Figure 4-4. Displaying attribute windows for data with six attributes Keim [KeK94] presented the first Pixel-oriented technique in the VisDB system, which has the capability to represent large amounts of multidimensional data with respect to a given query. As a result, users are able to refine their query based on the knowledge gathered from the visual representation of the data. Other pixel-oriented techniques have been developed, for example, Recursive Pattern Technique [KKA95], Circle Segments Technique [AKK96], Spiral [KeK94], Axes [Kei97], PBC [AEK00] and OPTICS [ABK+99]. They are successfully applied in data exploration for high dimensional databases.

29

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Figure 4-5. Illustration of the Recursive Pattern Technique For having more expressive data in a limited area, the recursive pattern of pixel-oriented technique has been proposed based on a generic recursive schema [KeK94]. With changeable parameters of the recursive schema, the user can control the semantically meaningful substructures, which determine the arrangement of the attribute values, as presented in Figure 4-5. A use of VisDB for visualizing financial information is illustrated in Figure 4-6.

Figure 4-6. The Recursive Pattern Technique in VisDB [KeK94] In particular, pixel-oriented techniques, aim at representing datasets in the input time order according to one attribute, because clustering arranges data items with similar values closer based on distance/density functions according to the similarity/dissimilarity measures.

30

CHAPTER 4. VISUAL CLUSTER ANALYSIS

However, the close data items are coloured similarly, but distributed in time series order by the pixel-oriented techniques, which cannot visualize the insight of clusters very well. Therefore, they are not suitable to be imposed as visual representation methods in cluster analysis very well.

4.1.3 Geometric Techniques


The basic idea of geometric techniques is to visualize the geometric transformations and projections of the data to produce useful and insightful visualizations. Geometric projection techniques aim at finding interesting projections of multidimensional data sets [Hub85] [FrT74]. The typical systems used the geometric techniques are Scatterplot-Matrices [And72, Che73], Parallel Coordinates [Ins85, ID90], Star Plots [Fie79], Landscapes [WTP+95], Projection Pursuit Techniques [Hub85], Prosection Views [FuB94, STD+95] and Hyperslice [WiL93]. Here we introduce several of them as follows. Scatterplot-Matrices

Plot-based data visualization approaches such as Scatterplot-Matrices [Cle93] and similar techniques [AlC91] visualize data in rows and columns of cells containing simple graphical depictions.

Figure 4-7. Scatterplot-Matrices [Cle93]

31

CHAPTER 4. VISUAL CLUSTER ANALYSIS

This category of techniques gives bi-attributes visual information. An example of ScatterplotMatrices is shown in Figure 4-7. The user can clearly observe each bi-attributes data distribution. However, plot-based techniques do not provide the best overview of the whole dataset. As a result, they are not able to present clusters of datasets very well. On the other hand, Plotbased visual techniques do not perform well on the presentation of large number of dimensional databases, due to the physical size limitation of computer monitors. Parallel Coordinates

A famous multidimensional visualization technique, Parallel Coordinates, utilizes equidistant parallel axes to visualize each attribute of a given dataset and projects multiple dimensions on a two-dimensional surface [Ins97]. The axes correspond to the dimensions and are linearly scaled from the minimum to the maximum value of the corresponding dimension. Each data item is presented as a polygonal line, intersecting each of the axes at that point which corresponds to the value of the considered dimension, as presented in Figure 4-8.

Figure 4-8. 15,000 coloured data items in Parallel Coordinates

32

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Star Plots arranges coordinate axes on a circle space with equal angles between neighbouring axes from the centre of a circle and links data points on each axis by lines to form a star [SGF71]. An example of Star Plots technique is presented in Figure 4-9. In principle, these two techniques can provide visual presentations of any number of attributes. However, neither Parallel Coordinates nor Star Plots is adequate to give the user a clear overall insight of data distribution when the dataset is huge, primarily due to the unavoidably high overlapping between data points. And another drawback of these two techniques is that, though they can supply a more intuitive visual relationship between the neighbouring axes, for the non-neighbouring axes, the visual presentation may confuse the users perception. These obstacles make them not properly to visualize cluster structure in very large and high dimensional databases.

Figure 4-9. Star plots of data items [SGF71] An integrated multidimensional data visualization system, XmdvTool has been proposed, which includes such as scatterplot matrix, parallel coordinates, star plots, and dimensional stacking by linking alternative views using brushing [War95]. Many other techniques have been introduced in multidimensional data visualization [OlL03]. However, most of them

33

CHAPTER 4. VISUAL CLUSTER ANALYSIS

either suffer from the weakness on visualizing large amount data items and higher dimensional data or hardly provide clearly clustered perception in visual form to the user. The multidimensional data visualization techniques developed earlier can be found in the literature [BeR78, Fie79, HoG01, Che07, FrD07]. In the last decade, many efforts have been made on using visualization techniques to assist data miners for finding cluster patterns in data. A survey of these visualization techniques is presented below.

4.2 Visual Cluster Analysis


Visual cluster analysis, as the term implies, is a discipline of information visualization and cluster analysis techniques. With wide applications of cluster analysis in data mining, many visualization techniques have been employed to study the structure of datasets in the applications of cluster analysis. Reviews of these works can be found in the literature [AnK01, HoG01, Kei02, OlL03, MiG04]. Several representative visualization techniques that are especially important in cluster analysis are discussed below.

4.2.1 MDS and PCA


Multidimensional scaling (MDS) maps multidimensional data as points in a 2D Euclidean space, where the distances between data points reflect similarity/dissimilarity of them [KrW78] as illustrated in Figure 4-10. However, the relative high computational cost of MDS (polynomial time O(N2)) limits its usability in very large datasets.

34

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Figure 4-10. Clustering of 1352 genes in MDS by [Bes] Principal Component Analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables (of higher dimensions) into a number of uncorrelated variables (of smaller, lower dimensions) called principal components [Jol02]. PCA first has to find the correlated variables for reducing the dimensionality, which restricts its performance in the exploration of unknown data.

4.2.2 HD-Eye
HD-Eye is an interactive visual clustering system based on density-plots of any two interesting dimensions [HKW03]. It projects any two dimensions of the multidimensional data based on density-plots to investigate interested grouping clues. HD-Eye employs icons to represent the clusters and the relationship between the clusters, as shown in Figure 4-12.

35

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Figure 4-12. The framework of HD-Eye system and its different visualization projections HD-Eye provides the user a rough data structure of a high dimensional data in visual representations. Whereas, the user hardly synthesizes all of the interesting 2D projections to find the general pattern of the clusters.

Figure 4-13. The 3D data structures in HD-Eye and their intersection trails on the planes

36

CHAPTER 4. VISUAL CLUSTER ANALYSIS

HD-Eye is also addressed to use 3D techniques to visualize data in mountain-like structures, and use the intersected planes of the 3D graphs for presenting the trails of the graphs on the planes in different level in 2D forms [HKW99], as illustrated in Figure 4-13. But the kernel density-based 3D graphs formation in HD-Eye limits it to be employed in the interactive cluster detection of large datasets.

4.2.3 Grand Tour


The Grand Tour technique uses a series of variable projections to map multidimensional data onto a two orthogonal 2D space in order to obtain different perspectives of data [Asi85]. For effectively reducing huge search space of data, Projection Pursuit is introduced to help the user for the purpose of investigating the interesting projections [CBC+95]. The projection of Grand Tour with Projection Pursuit is illustrated in Figure 4-14a. However, due to Grand Tour systems have several times projections and complicated computation, their visualization models are not intuitive to users. Based on the Grand Tour technique, several extensions have been proposed. For example, Yang implemented a 3D version Grand Tour technique to project data in animations [Yan03], but the complex 3D graph formation of this technique limits its use in large scale data visualization. An example of Yangs Grand Tour based visualization is presented in Figure 414b. Dhillon, et al. proposed a technique to visualize cluster structure [DMS98], but their technique visualizes 3 clusters. It requires more assistance with a sophisticated Grand Tour technique to deal with more than 3 clusters.

37

CHAPTER 4. VISUAL CLUSTER ANALYSIS

a. The projections of the Grand Tour technique

b. Grand Tour based 3D animation by [Yan03]

Figure 4-14. The Grand Tour Technique and its 3D example

4.2.4 Hierarchical BLOB


Based on the hierarchical clustering and visualization algorithm H-BLOB, Sprenger et al presented a technique for visualizing hierarchical clusters in the nested blobs [SBG00], as shown in Figure 4-15.

Figure 4-15. Cluster hierarchies are shown for 1, 5, 10 and 20 clusters [SBG00] The most significant feature of their technique is that, H-BLOB not only provides the overview manner of whole dataset in blobs, but also gives the detailed visual representation of

38

CHAPTER 4. VISUAL CLUSTER ANALYSIS

lower levelled clusters. Exhibiting clusters in the form of blobs H-BLOB results in a very intuitive and easily understood visual presentation. However the high visual complexity of the two stages of blob graphs formation makes them unsuitable to be applied in cluster visualization of very large sized datasets.

4.2.5 SOM
Kaski el. al employs Self-organizing maps (SOM) technique [Koh97] to project multidimensional data sets to 2D space for matching visual models [KSP01]. Technically, in their method, a sample data is mapped into a bar graph, then the graph is compared with all existing vector models in bar graphs to find the most matched one, as shown in Figure 4-16, where the bar graphs in the rectangular region are existing models, Xk is the sample bar graph.

Figure 4-16. Model matching with SOM by [KSP01] However, the traversal matching process is time-consuming. On the other hand, the SOM technique is based on a single projection strategy. It is not powerful enough to discover all the interesting features from the original data. Another drawback of this technique is that, with an

39

CHAPTER 4. VISUAL CLUSTER ANALYSIS

increasing number of dimensions, the bar graphs would be wider. As a result, the user cannot easily observe the matched cluster model in intuition.

4.2.6 FastMap
Huang et. al proposed the approaches based on FastMap [FaL95] to assist users in identifying and verifying the validity of clusters in visual form [HCN01, HuL00]. Their techniques work well in cluster identification, but are unable to evaluate the cluster quality very well. On the other hand, these techniques visualize clusters statically and do not always present the genuine cluster structure. As a consequence, they do not provide enough information for either clustering or cluster validation.

4.2.7 OPTICS
OPTICS uses a density-based technique to detect cluster structure and visualizes them in Gaussian bumps [ABK+99]. It is an intuitive method to assist the user to observe cluster structures. But its non-linear time complexity makes it neither suitable to deal with very large data sets, nor suitable to provide the contrast between clustering results, as shown in Figure 4-17.

Figure 4-17. Data structure mapped in Gaussian bumps by OPTICS [ABK+99] OPTICS also visualizes clusters in 1D visualization manner [ABK+99]. It works well in finding the basic arbitrarily shaped clusters, as presented in Figure 4-18. However it lacks the ability in helping the user understand inter-cluster relationships.

40

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Figure 4-18. Clustering structure of 30,000 16-Dimensional data items Visualized by OPTICS [ABK+99]

4.2.8 Star Coordinates and VISTA


The most relevant approach to this thesis is the Star Coordinates technique [Kan01]. The idea of Star Coordinates technique is intuitive, which extends the perspective of traditional orthogonal 2D X-Y and 3D X-Y-Z coordinates technique to a higher dimensional space. Star Coordinates plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space [Kan01]. First, data in each dimension are normalized into [0, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the X-Y 2D plane. Figure 4-19 illustrates the mapping from 8 Star Coordinates to X-Y coordinates.

41

CHAPTER 4. VISUAL CLUSTER ANALYSIS

Figure 4-19. Positioning a point by an 8-attribute vector in Star Coordinates [Kan01] Star Coordinates provides users the ability to apply various transformations dynamically, integrate and separate dimensions of interest, analyze correlations of multiple dimensions, view clusters, trends, and outliers in the distribution of data. Formula (4-1) states the mathematical description of Star Coordinates.
n

p j ( x, y ) = (

i =1

r u xi (d ji min i ),

n i =1

u yi (d ji min i ))

(4-1)

where pj (x, y) is the normalized location of Dj=(dj1, dj2, ., djm), and dji is the value of the jth record of a data set on the ith coordinate Ci in Star Coordinates space; xi(dji-mini) and yi(dji-mini) are unit vectors of dji mapping to X direction and Y direction, mini=min(dji,0j<m) and maxi=max (dji, 0 j<m) are the minimum and maximum values of the ith dimension respectively; and m is the number of records in the data set. As presented in the formula (4-1), the computational complexity of Star Coordinates projection is linear time. Therefore, the Star Coordinates based techniques are powerful for interactive visualization and analysis of clusters. Interactive Functions of Star Coordinates Star Coordinates provides various interactive functions to stimulate visual thinking in early stages of the knowledge discovery process. Those functions include scaling axes, see Figure

42

CHAPTER 4. VISUAL CLUSTER ANALYSIS

4-20-a; rotating angle between axes Figure 4-20-b; marking data points in a certain area by coloring; selecting data value ranges on one or more axes and marking the corresponding data points in the visualization; presenting histograms of selected clusters; foot print for tracing the foot points for data points, see Figure 4-20-c and etc [Kan01].

a. Axis scaling of name attribute of autompg data

b. Angle rotation of the attributes of auto-mpg data

c. Foot prints of axes scaling of weight and mpg attributes of auto-mpg data Figure 4-20 axis scaling, angle rotation and foot print functions of Star Coordinates [Kan01] The Star Coordinates based techniques Based on the idea of the Star Coordinates, instead of normalizing data in each dimension into [0, 1] interval by Star Coordinates, Chen and Liu proposed an approach, named -mapping, which normalizes data into [-1, 1] interval of each dimension for having more expressive space of axis scaling in their VISTA and/or iVIBRATE systems [ChL04, ChL06]. Moreover, Chen and Liu also discussed using their approach to refine and verify clusters by

43

CHAPTER 4. VISUAL CLUSTER ANALYSIS

VISTA/iVIBRATE [ChL06]. Shaik and Yeasin addressed a 3D manner of the Star Coordinates to provide users a more intuitive 3D environment for the observation of cluster structure [ShY06]. Ma and Teoh employed the Star Coordinates technique into their StarClass system for data classification by visualization [MaT03]. A very similar technique of Star Coordinate, RadViz were presented by [HGM97]. But its non-linear mapping is an obstacle for RadViz to be employed as an interactive tool for cluster analysis of very large-sized databases. The issues of the existing Star Coordinates based Visualization Techniques The existing Star Coordinates based visualization techniques employed in cluster analysis tend to be used as information rendering tools, but do not perform well on verifying the validation of clustering results. On the other hand, the exploration-oriented characteristics of these techniques, inevitably lead them to be random and imprecise in the process of cluster detection and validation. Chen and Liu combined clustering algorithms and their visualization technique VISTA/iVIBRATE to observe cluster structures of datasets, refine the quality of clusters produced by clustering algorithms, and validate clusters [ChL04, ChL05]. However, the data observation based on -mapping (-adjustment) of their approach is still an randomly exploratory process, which inevitably suffers from subjectivity and randomness. In addition, VISTA adopts landmark points as representatives from a clustered subset and resamples them to deal with cluster validation [ChL04]. But its experience-based landmark point selection does not always handle the scalability of data very well, due to the wellrepresentative landmark points selected in a subset may fail in other subsets of a database.

44

CHAPTER 4. VISUAL CLUSTER ANALYSIS

4.3 Major Challenges


Visualization is considered as a collection of transformations from the problem domain to the representation domain [GMH+94]. A more practical and effective approach of cluster visualization is to incorporate all available clustering information, for example algorithmic clustering results and the domain knowledge, into visual cluster exploration.

4.3.1 Requirements of Visualization in Cluster Analysis


By the above analysis, we can summarize that the visualization techniques to be used in cluster analysis should be able to handle several important aspects of visual perception: 1. Visualizing large and multidimensional datasets; 2. Providing a clear overview and detailed insight of cluster structure; 3. Having linear time complexity on data mapping from higher dimensional space to lower dimensional space; 4. Supporting interactive cluster visual representation dynamically; 5. Involving knowledge of domain experts into the cluster exploration; 6. Giving data miners purposeful and precise guidance of cluster investigation and cluster validation rather than simply random cluster exploration; As discussed above, most existing cluster visualization techniques work well on visualizing multidimensional data sets. However, as the size and dimensiononality of data sets increase, these techniques do not perform well on very large data visualization, they can hardly deal with visual representation of higher dimensional data, they can not provide an intuitive overview of cluster structure, etc. In short, they satisfy not all of the above requirements.

4.3.2 Motivation

45

CHAPTER 4. VISUAL CLUSTER ANALYSIS

A question arises: which visualization technique can provide a genuine representation of cluster structure of data? In practice, a few visualization techniques can achieve the above requirements. As Seo and Shneiderman pointed out, A large number of clustering algorithms have been developed, but only a small number of cluster visualization tools are available to facilitate researchers understanding of the clustering results [SeS05]. How to preserve the identity of problem domain and representation domain by visualization is the critical challenge of cluster visualization. Star Coordinates based techniques are a good choice for cluster visualization, because they almost meet all the considerations above, except the last one. Simple static visualization is not sufficient in visualizing clusters [Kei01, Shn02], and it has been shown that clusters can hardly be satisfactorily preserved in a static visualization [CBC+95, DMS98]. With the feature of linear time transformation/projection, Star Coordinates based techniques are powerful for large scale data visualization, especially for interactive and dynamic cluster visualization. But the random and subjective characteristics of these techniques hinder their effectiveness and efficiency in real-world applications. The main motivation of this thesis is to provide an effective and purposeful visual guidance to data miners in cluster analysis.

4.4 Our Approach


In this subsection, we briefly describe a novel approach called HOV3 for addressing the challenge presented above. As a publication-based thesis, the detailed discussion of the works in the thesis can be found in the cited papers.

46

CHAPTER 4. VISUAL CLUSTER ANALYSIS 4.4.1 HOV3 Model


Visualization is typically employed as an observational mechanism to assist users with intuitive comparisons and better understanding of the studied data. Instead of precisely contrasting clustering results, most of the existing visualization techniques employed in cluster analysis focus on providing the user with an easy and intuitive understanding of the cluster structure, or explore clusters randomly. In general, it is not easy to visualize multidimensional data sets on 2D space and give a genuine visual interpretation. This is because mapping multidimensional data onto 2D space inevitably introduces overlapping and bias. For mitigating the problem, Star Coordinates based techniques provide some visual adjustment mechanisms [Kan01, ChL04, ChL06]. However, the stochastic adjustment of Star Coordinates and VISTA limits their usability in cluster analysis. To overcome the arbitrary and random adjustments of Star Coordinates and its extensions, Zhang et al proposed a hypothesis-oriented visual approach (Hypothesis Oriented Verification and Validation by Visualization) HOV3 in short, to detect clusters [ZOZ+06, ZOZ06]. The idea of HOV3 is that, in analytical geometry, the difference of a data set (a matrix) Dj and a measure vector M with the same number of variables as Dj can be represented by their inner product, DjM. HOV3 uses a measure vector M to represent the corresponding axes weight values. Then given a non-zero measure vector M in n, and a family of vectors Pj, the projection of Pj against M in the complex number system, the HOV3 model is presented as:

Pj (z0 ) = k1[(d jk min(d k )) / (max(d k ) min(d k )) z0k mk ] k k k =


n

(4-2)

47

CHAPTER 4. VISUAL CLUSTER ANALYSIS

where

(dk) and

(dk) represent the minimal and maximal values of kth dimension

respectively; and mk is the kth attribute of measure M. The aim of interactive adjustments of Star Coordinates and its extensions is to have some separated groups or full-separated clustering result of data by tuning the weight value of each axis (axis scaling in Star Coordinates, -adjustment in VISTA/iVIBRATE), but their arbitrary and random adjustments limit their applicability. As shown in formula (4-2), HOV3 summarizes these adjustments as a coefficient/measure vector. Compared the formulas (4-1) and (4-2), it can be observed that HOV3 subsumes the Star Coordinates model [ZOZ06]. Thus the HOV3 model provides the user a mechanism to quantify a hypothesis/prediction about a data set as a measure vector of HOV3 for precisely exploring grouping information.

4.4.2 External Cluster Validation by HOV3


With the quantified measurement feature of HOV3, an external cluster validation method based on distribution matching is proposed to verify the consistency of cluster structures between a clustered subset and non-clustered subsets of a dataset [ZOZ07a]. The idea of this approach is based on the assumption that If two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high. This approach employs a clustered subset from a database as a visual model (classifier) to verify the similarity of cluster structures between the model and the other same-sized nonclustered subsets of the database by projecting them together in HOV3. Technically, the user first separates each overlapped cluster individually by axes scaling or M-Mapping in HOV3. Then data points in the separated cluster and their geometrical covered data points, called quasi-cluster, in the non-clustered subset are picked up. Finally, instead of using statistical

48

CHAPTER 4. VISUAL CLUSTER ANALYSIS

methods to assess the similarity between the two subsets, this approach simply computes the overlapping rate between the clusters and their quasi-clusters to show their consistency. Compared with the statistics-based validation methods, distribution matching based external cluster validation is not only visually intuitive, but also more effective in real applications [ZOZ07a].

4.4.3 Enhanced the Separation of Clusters by HOV3


To assist data miners investigating cluster clues effectively, an approach called M-HOV3/MMapping is introduced to enhance the separation of data groups in cluster analysis by HOV3 [ZOZ07b, ZOZ07c]. In the paper [ZOZ07c], the mathematical proof has been presented of the property that, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV3 to a data set, then the application of M-HOV3/M-mapping with the measure vector to the data set would lead to the groups being more contracted, in other words, having a good separation of the groups. This feature is significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of MHOV3/M-mapping keeps the data points within a cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-HOV3/Mmapping can extend the distance of far data points relatively further. With the advantage of the enhanced separation and contraction features of M-HOV3/M-mapping, the user can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of data points among the clusters effectively in the postprocessing stage of clustering by M-HOV3/M-mapping.

49

CHAPTER 4. VISUAL CLUSTER ANALYSIS 4.4.4 Prediction-based Cluster Analysis by HOV3


Having a precise overview of data distribution in the early stages of data mining is important, because having correct insights of data overview is helpful for data miners to make decisions on adopting appropriate algorithms for the forthcoming analysis stages. Exploration discovery (qualitative analysis) is regarded as a pre-processing of verification discovery (quantitative analysis), which is mainly used for building user predictions based on cluster detection, or other techniques. It is an iterative process under the guidance of the users domain knowledge, but not an aimless and/or arbitrary process. In each iteration of exploration discovery, the users feedback provides new insights and enriches their domain knowledge on the dataset they are dealing with. Predictive exploration is a mathematical description of future behaviour based on historical exploration of patterns. The goal of predictive visual exploration by HOV3 is that by applying a prediction (measure vector) to a dataset, the user may identify the groups from the result of visualization. Thus the key issue of applying HOV3 to detect grouping information is how to quantify historical patterns (or users domain knowledge) as a measure vector to achieve this goal. Equation (2) is a standard form of linear transformation of n variables, where mk is the coefficient of the kth variable of Pj. In principle, any measure vector, even in complex number form, can be introduced into the linear transformation of HOV3 if it can distinguish a data set into groups or have well separated clusters visually. For example, the randomly obtained the grouping data by axis scaling in HOV3, M-Mapping/M-HOV3, the statistical

50

CHAPTER 4. VISUAL CLUSTER ANALYSIS

methods that reflect the characteristics of studied data set can be introduced as predictions in the HOV3 projection. With the quantified measurement of HOV3 and enhanced separation features of MMapping/M-HOV3, the user not only can summarise their historically explored knowledge about datasets as predictions but also directly introduce the abundant statistical measurements of the studied data as predictions to investigate cluster clues, or refine clustering results effectively [ZOZ07c, ZOZ07d]. In fact, prediction based cluster detection by statistical measurements in HOV3 is more purposeful cluster exploration, and it gives an easier geometrical interpretation of the data distribution. In addition, with the statistical predictions in HOV3 the user may even expose the cluster clues that are not easy to be found by random cluster exploration. To separate clusters from a lot of overlapped data points is an aim of this thesis. Based on the work such as M-HOV3/M-mapping, and HOV3 with statistical measurement, any measure that resulted in fully separated clusters can be treated as predictions to be introduced into External Cluster Validation based on Data Distribution Matching by HOV3. In principle, any linear transformation, even the complex linear transformation, can be employed into HOV3 if it can separate clusters well. With the well-separated clusters, the efficiency of external cluster validation by HOV3 may be improved [ZOZ07c, ZOZ07d].

51

CHAPTER 5. CONCLUSION AND FUTURE WORK

CHAPTER 5
CONCLUSION AND FUTURE WORK
5.1 Conclusion
This thesis has proposed a novel visual approach called HOV3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis of highdimensional datasets. HOV3 provides data miners an effective mechanism to introduce their quantified domain knowledge as predictions in the cluster exploration process for revealing the gaps of data distribution against the predictions. As a result, it is more efficient and purposeful by using HOV3 to investigate cluster clues in very large and high-dimensional datasets. This thesis has also proposed a visual cluster validation approach based on distribution matching supported by the projection mechanism of HOV3. This approach is based on the assumption that by using a measure vector to project the data sets in the similar cluster structure, the similarity of the changes of their behaviour of data distribution should be high. By comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV3 with measures, the data miners can intuitively have a visual assessment, and also have a precise evaluation of the consistency of the cluster structure by performing geometrical computation on their data distributions. Compared with existing visual techniques involved in cluster validation, it has been observed that this approach is not only efficient in performance, but also effective in real-world applications.

53

CHAPTER 5. CONCLUSION AND FUTURE WORK


Based on the projection technique of HOV3, a visual approach called M-HOV3/M-mapping has also been introduced to enhance the visual separation of clusters. The visual separability of clusters is significant for cluster analysis. Fully geometrical separation of clusters is not only beneficial in revealing the membership formation of clusters, but also beneficial in verifying the validity of clustering results. With M-HOV3/M-mapping, data miners can both explore cluster distribution intuitively and verify clustering results effectively by matching the geometrical distributions of clustered and non-clustered subsets produced by MHOV3/M-mapping. Experiments show that HOV3 technique can improve the effectiveness of cluster analysis by visualization. HOV3 can be seen as a bridging process between qualitative analysis and quantitative analysis. It not only supports quantified domain knowledge verification and validation, but also directly utilizes the abundant statistical measurements of the studied data as predictions in order to give data miners an effective guidance for having more precise cluster information in data mining. As a consequence, with the advantage of the quantified measurement feature of HOV3 data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify and refine the membership of data points among the clusters effectively in the post-processing stage of clustering. We believe the application of HOV3 will be fruitful.

5.2 Future Work


This thesis has addressed the challenges of introducing visualization techniques to cluster analysis in data mining, and proposed a visual technique called HOV3 to mitigate the

54

CHAPTER 5. CONCLUSION AND FUTURE WORK

problems in visual cluster analysis. However, there are still some open research issues worth future efforts.

5.2.1 Three Dimensional HOV3


This thesis has introduced the quantified measures as predictions with HOV3 to detect cluster clues and verify clustering results in clustering large datasets that automated clustering algorithms cannot effectively handle. So far HOV3 projects high dimensional data onto 2D space [ZOZ06]. 3D visualization can provide more intuitions and also more information on the studied data [Rei95]. However, most of the existing 3D visual techniques involved in cluster analysis are density-based or metaphor-based [KOC+04]. They suffer from the high computational cost on composing 3D graphs of clusters. This drawback limits them to be applied for 3D cluster investigation in very large databases, especially for 3D interactive cluster exploration. [Yan03]. Recently, Shaik and Yeasin proposed a 3D visualization model based on the Star Coordinates technique [ShY06]. However, the relatively complex projection of 3D formation of their approach is a drawback on 3D visualization of large datasets. In fact, with the known two orthogonal vectors in HOV3 to compose the third dimensional vector is not hard. Then the 3D visual presentation of data in HOV3 can be produced by linear combinations of the three vectors. Based on the advantage of linear time complexity of HOV3 projection, the 3D HOV3 projection is also in linear time. Thus data miners may more effectively grasp the cluster clues from the studied datasets by the interaction of 3D HOV3 exploration.

5.2.2 Dynamic Visual Cluster Analysis

55

CHAPTER 5. CONCLUSION AND FUTURE WORK

Dynamic clustering is also called stepwise clustering, which is a kind of iterative clustering method based on distance [ZHW+03]. Dynamic clustering intends to study the groups behaviour changes and revise clusters dynamically along with the cluster exploration process, even revise the criteria of clustering. It deals with data grouping as a cluster analysis in time series [BeC96, AbM98, CHS04]. In each clustering iteration, clustering algorithms sample at the time series points and revise the formation of clusters dynamically by given criteria. However, the existing clustering algorithms do not perform well with arbitrarily shaped data distribution of the datasets, and the very high computational cost of the statistics-based cluster validation methods limits their usability in real time applications. Based on the HOV3 model, We have proposed a cluster validation method based on distribution matching in this thesis [ZOZ07a]. This approach can provide a solution to the above problem, because the approach only calculates the overlapping rate between the classifier (a clustered subset of a dataset) and its geometrically covered data points. It is much quicker than the existing statistics based cluster validation methods [ZOZ07a]. For revising clustering criteria, the newly produced clustering criteria can be generated automatically by the density function of the data points of overlapped area.

5.2.3 Quasi-Cluster Data Points Collection


Based on the quantified measurement feature of HOV3, an external cluster validation based on distribution matching has been proposed in this thesis to verify the consistency of cluster structures between a clustered subset and non-clustered subsets of a dataset [ZOZ07a]. But, so far, the quasi-cluster point is picked up manually by a geometrical intuition. The Newton method of data analysis can be introduced into this quasi-cluster point collection. The Newton

56

CHAPTER 5. CONCLUSION AND FUTURE WORK

method is an efficient approach to find the neighbouring points of a given point [Smi86]. This would improve the accuracy and effectiveness of quasi-cluster point collection in HOV3.

5.2.4 Combination of Fuzzy Logical approaches and HOV3


Fuzzy clustering is an active branch of cluster analysis [OlP07]. Instead of data points being only exactly assigned into one cluster, in fuzzy (soft) clustering, data points can belong to more than one cluster [Sim93]. The data points can be associated with different grades with clusters. The grades of data points indicate the nearness degree of relationship to clusters. However, when fuzzy clustering algorithms deal with dynamic clustering applications, the recomputation of the grades of membership associated with clusters is very high [BLO+03]. In fuzzy clustering proposed in [BaB99], each data point has a vector V(1...k) associated with the K clusters [Bez81]. In this thesis, HOV3 model has been proposed to assist data miners in cluster investigation and verification [ZOZ+06, ZOZ06]. The HOV3 model is presented in the formula (8) in [ZOZ06]. There, the measure coefficient mk of the kth dimension can be combined with the associated grade of each data point. Thus with the color mapping function [Fai98], HOV3 could provide very intuitive visual presentation of the membership of each data point, due to the closest data points being colored similarly. This approach would be very helpful to the data miners to identify the membership formation of clusters during interactive cluster exploration.

57

CHAPTER 5. CONCLUSION AND FUTURE WORK

58

APPENDIX

APPENDIX
The Relevant Publications with This Thesis:
1. K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the working conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp. 254-257 (2006) K-B, Zhang, M. A. Orgun, K. Zhang, HOV3: An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp.316-327 (2006) K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press. pp. 576-582 (2007) K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp. 285-297 (2007) K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 17-21, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp. 336349 (2007) K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (2007) (submitted)

2.

3.

4.

5.

6.

Not Relevant Publications with This Thesis:


7. K-B. Zhang, M.A.Orgun and K. Zhang, "Compiled Visual Programs by VisPro", PanSydney Area Workshop on Visual Information Processing, Sydney Australia, December 2003, Australian Computer Society Press. Vol. 36, pp.113-117 (2004) K-B. Zhang, K. Zhang, and M.A.Orgun "Semantic Specifications in Reserved Graph Grammars", The Ninth International Conference on Distributed Multimedia Systems (DMS'2003), Florida International University Miami, Florida, USA September (2003)

8.

59

APPENDIX

60

BIBLIOGRAPHY

BIBLIOGRAPHY
[AAP+03] A. L. Abul, R. Alhajj, F. Polat, and K. Barker, Cluster Validity Analysis Using Subsampling, Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, IEEE Press, Vol. 2: pp. 1435-1440 (2003) [ABK+99] M. Ankerst, M. Breunig, H.-P. Kriegel, J. Sander OPTICS: Ordering Points To Identify the Clustering Structure, Proceedings of ACM SIGMOD 99, International Conference on Management of Data, Philadelphia, PA. pp. 49-60 (1999) [AbM98] A. J. Abrantesy , J. S. Marques, A Method for Dynamic Clustering of Data, Proceedings of the British Machine Vision Conference 1998, BMVC 1998, Southampton, UK, 1998. British Machine Vision Association, pp.154-163 (1988) [AEK00] M.Ankerst, M. Ester M, H. P. Kriegel, Towards an Effective Cooperation of the Computer and the User for Classification, Proceedings of. ACM SIGKDD International Conference On Knowledge Discovery & Data Mining (KDD 2000), Boston, MA, pp. 179-188 (2000) [AGG+98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the ACM SIGMOD Conference, Seattle, WA., pp.94-105 (1998) [AhW95] C. Ahlberg, E. Wistrand, IVEE: An Environment for Automatic Creation of Dynamic Queries Applications, Proceedings of Human Factors in Computing Systems CHI 95 Conference, Demo Program, Denver, CO (1995) [AKK96] M. Ankerst, D. A. Keim, H.-P. Kriegel, Circle Segments: A Technique for Visually Exploring Large Multidimensional Data Sets, Proceedings of Visualization 96, Hot Topic Session, San Francisco, CA, 1996. [ALA+03] O. Abult, A. Lo, R. Alhajjt, F. Polat, K. Barked, Cluster Validity Analysis Using Subsampling, Proceedings of IEEE International Conference on Systems, Man and Cybernetics (IEEE-SMC), Vol.2 pp.1435- 1440 (2003) [AlC91] B. Alpern, L. Carter. Hyperbox, Proceedings of Visualization 91, San Diego, CA, pp.133-139 (1991) [And72] D. F. Andrews, Plots of High-Dimensional Data, Biometrics, Vol. 29, pp. 125-136 (1972) [And73] M. Anderberg, Cluster Analysis for Applications. New York: Academic (1973) [AnK01] M. Ankerst, and D. Keim, Visual Data Mining and Exploration of Large Databases, Proceedings of 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), Freiburg, Germany, September 2001

61

BIBLIOGRAPHY

[Asi85] D. Asimov, The grand tour: A tool for viewing multidimensional, SIAM Journal of Scientificand Statistical Computing, Vol. 6 (1), pp. 128-143 (1985)

[BaB99]A. Baraldi, P. Blonda, A Survey of Fuzzy Clustering Algorithms for Pattern RecognitionPart I, IEEE Transactions on Systems, Man, and CyberneticsPart b: Cybernetics, vol. 29(6), pp.778-785 (1999) [BeC96] D. J. Berndt and J. Clifford, Finding patterns in time series: A dynamic programming approach, in Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI/MIT Press, 1996, pp. 229248. [Bed90] J. Beddow, Shape Coding of Multidimensional Data on a Mircocomputer Display, Proceedings of Visualization 90, San Francisco, CA, 1990, pp. 238-246. [BeG03] A. Ben-Hur and I. Guyon, Detecting stable clusters using principal component analysis, Methods in Molecular Biology, M.J. Brownstein and A. Kohodursky (eds.) Humana press, pp.159-182 (2003) [BEG02] A. Ben-Hur, A. Elisseeff and I. Guyon, A stability based method for discovering structure in clustered data, Proceedings of the Pacific Symposium on Biocomputing (2002) [Ber06] Berkhin, P: A Survey of Clustering Data Mining Techniques, Kogan, Jacob; Nicholas, Charles; Teboulle, Marc (Eds.) Grouping Multidimensional Data, Springer Press pp. 25-72 (2006) [BeR78] J. R. Beniger and D. L. Robyn, Quantitative graphics in statistics: A brief history, The American Statistician, Vol 32(1) pp. 1-9 (1978) [Bes] C. Best, http://www.computationalgroup.com/tigertiger/cb/index.html [Bez81] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function, Alnon'thms. Plenum Press. New York. 1981 [BLO+03] M. Buerki, K.O. Lovblad, H. Oswald, A.C. Nirkko, P. Stein, C. Kiefer and G. Schroth, Multiresolution fuzzy clustering of functional MRI data, Neuroradiology Vol.45, pp.691-699 (2003) [BPR+04] C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, Subspace Selection for Clustering High-Dimensional Data, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM04), pp:11-18 (2004) [CBC+95] D. Cook, A. Buja, J. Cabrera, and C. Hurley, Grand tour and projection pursuit, Journal of Computational and Graphical Statistics, vol. 23, pp.155-172 (1995) [Che07] C. Chen, A Brief History of Data Visualization, W. Hardle and A. Unwin (eds.), Handbook of Computational Statistics: Data Visualization, Vol III, Springer, 2007. [Che73] H. Chernoff, The Use of Faces to Represent Points in k-Dimensional Space Graphically, Journal Amer. Statistical Association, Vol. 68, pp.361-368 (1973)

62

BIBLIOGRAPHY

[Chi00] E. Chi. A taxonomy of visualization techniques using the data state reference model, Proceedings of the Symposium on Information Visualization (InfoVis2000), pp.69-75 (2000) [CHS04] W.-P. Chen, J. C. Hou, L. Sha, Dynamic Clustering for Acoustic Target Tracking in Wireless Sensor Networks, IEEE Transactions on Mobile Computing, Vol. 3 (3), JulySeptember 2004, pp. 358-371 [CKS+88]] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, AutoClass: A bayesian classification system, Proceedings of 5th International Conference on Machine Learning, Morgan Kaufmann, pp. 54-64 (1988) [Cle93] W. S. Cleveland, Visualizing Data, AT&T Bell Laboratories, Murray Hill, NJ, Hobart Press, Summit NJ, (1993) [CMS99] S. K. Card, J. D. Mackinlay, and B. Shneiderman, editors. Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco, 1999. [DMS98] I. S. Dhillon, D. S. Modha, and W. S. Spangler, Visualizing class structure of multidimensional data, the 30th Symposium on the Interface: Computing Science and Statistics,Vol. 30, pp.488493 (1998) [Dom01] B. Dom, "An information-theoretic external cluster-validity measure", Research Report, IBM T.J. Watson Research Center RJ 10219 (2001) [DuJ79] R. Dubes and A. K. Jain, "Validity studies in clustering methodologies", Pattern Recognition, Vol. 1(1), pp.235-254 (1979) [EKS+96] M. Ester, H-P Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp.226-231 (1996) [ELL01] B. Everitt, S. Landau , and M. Leese, Cluster Analysis. London: Arnold, 2001. [Fai98] Mark D. Fairchild, Color Appearance Models, Addison-Wesley, Reading, MA (1998) [FaL95] C. Faloutsos and K. Lin, Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets Proceedings of ACM-SIGMOD95, pp.163-174. (1995) [FGW02] U. Fayyad, G. Grinstein and A. Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann Publishers, 2002 [Fie79] S. E. Fienberg, Graphical methods in statistics, American Statisticians Vol.33 pp. 165-178 (1979) [Fis87] D. Fisher, Improving Inference through Conceptual Clustering, Proceedings of 1987 AAAI Conferences, Seattle Washington, pp.461-465 (1987)

63

BIBLIOGRAPHY

[FoM83] E. Fowlkes and C. Mallows, A method for comparing two hierarchical clusterings, Journal of American Statistical Association,Vol. 78, pp.553569 (1983) [FrD01] J Fridlyand J. and Dudoit S., Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method, University of California, Statistics Department Technical Report, No.600 (2001) [FrD07] M. Friendly, D. J. Denis, Milestones in the history of thematic cartography, statistical graphics, and data visualization, http://www.math.yorku.ca/SCS/Gallery /milestone/Visualization_Milestones.pdf, York University, Canada (2007) [FrT74] J. Friedman, J. Tukey, A Projection Pursuit Algorithm for Exploratory Data Analysis, IEEE Transactions on Computers, Vol. 23, pp. 881-890 (1974) [FuB94] G. W. Furnas, A. Buja, Prosections Views: Dimensional Inference through Sections and Projections, Journal of Computational and Graphical Statistics, Vol. 3(4), pp.323-353 (1994) [Fuk90] K. Fukunaga, "Introduction to Statistical Pattern Recognition, San Diego CA, Academic Press (1990) [GMH+94] G. Grinstein T. Mihalisin, H. Hinterberger A. Inselberg, Visualizing multidimensional (multivariate) data and relations, Proceedings of the conference on Visualization '94, IEEE Visualization, pp. 404-409 (1994) [Gor98] A. D. Gordon, "Cluster validation, Data Science, Classification, and Related Methods, C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H-H. Bock and Y. Baba Edited Springer, Tokyo, pp 22-39(1998) [GRS98] S. Guha, R. Rastogh and K. Shim, CURE: An efficient clustering algorithm for large databases, Proceedings of ACM SIGMOD Conference 98, pp.7384 (1998) [HaK01] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers (2001) [HaJ97] P. Hansen and B. Jaumard, Cluster analysis and mathematical programming, Math. Program, Vol. 79, pp.191215 (1997) [HaV01] M. Halkidi and M. Vazirgiannis, Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set, Proceedings of ICDM 2001, pp. 187-194 (2001) [Har75] J. Hartigan, Clustering Algorithms. New York: Wiley (1975) [HBV01] M. Halkidi, Y. Batistakis and M. Vazirgiannis M., On Clustering Validation Techniques, Journal of Intelligent Infomation Systems, Vo1.7(2-3) (2001) [HBV02] M. Halkidi, Y. Batistakis, M. Vazirgiannis: Cluster Validity Methods: Part I&II, SIGMOD Record,Vol. 31(2-3) (2002)

64

BIBLIOGRAPHY

[HCN01] Z. Huang, D. W. Cheung, M. K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, Proceedings of the 7th International Conference on Database Systems for Advanced Applications, pp. 84-91 (2001) [Hea95] M. Hearst, TileBars: Visualization of Term Distribution Information in Full Text Information Access, Proceedings of ACM Human Factors in Computing Systems Conference, (CHI'95), pp.59-66 (1995) [HGM97] P. Hoffman, G. Grinstein,. K. Marx, I. Grosse, and E. Stanley, Dna visual and analytic data mining, IEEE Visualization, pp. 437-442 (1997) [HiK98] A. Hinneburg and D. Keim, "An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of KDD-98 (1998) [HKK05] [J. Handl, J. Knowles and D. B. Kell, Computational cluster validation in postgenomic data analysis, Journal of Bioinformatics, Vol. 21(15), pp.3201-3212 (2005) [HKW99] A. Hinneburg, D. A. Keim., M, Wawryniuk, HD-Eye:Visual Mining of HighDimensional Data, IEEE Computer Graphics and Applications, Volume 19, Issue 5 (September 1999), pp.22-31 [HKW03] A. Hinneburg, D. A. Keim., M, Wawryniuk,HD-Eye-Visual Clustering of High dimensional Data, Proceedings of the 19th International Conference on Data Engineering, pp.753-755 (2003) [HoG01] Patrick E. Hoffman Georges G. Grinstein, A survey of visualizations for multidimensional data mining, Information visualization in data mining and knowledge discovery, Morgan Kaufmann Publishers Inc, pp. 47-82, 2001 [Hub85] P. J. Huber, Projection Pursuit, The Annals of Statistics, Vol. 13 (2), pp.435-474 (1985) [HuL00] Z. Huang and T. Lin, A visual method of cluster validation with Fastmap, Proceedings of PAKDD-2000, pp.153- 164 (2000) [InD90] A. Inselberg, B. Dimsdale, Parallel Coordinates: A Tool for Visualizing MultiDimensional Geometry, Proceedings of Visualization 90, San Francisco, CA, pp. 361-370 (1990) [Ins85] A. Inselberg, The Plane with Parallel Coordinates, Special Issue on Computational Geometry, The Computer, Vol. 1, pp. 69-97 (1985) [Ins97] A. Inselberg, Multidimensional Detective, Proceedings of IEEE Information Visualization '97 pp.100-107 (1997) [JaD88] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall Press (1988) [Jac08] S. Jaccard, Nouvelles recherches sur la distribution florale, Bull. Soc. Vaud. Sci. Nat., 44, pp.223-270 (1908)

65

BIBLIOGRAPHY

[JMF99] A. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31(3), pp. 264-323 (1999) [Jol02] T. Ian Jolliffe. Principal Component Analysis, Springer Press (2002)
[Kei01] D. A. Keim, Visual exploration of large data sets, ACM Communication, vol. 44 (8), pp.38 44, (2001)

[Kan00] E. Kandogan, Star Coordinates: A Multi-dimensional Visualization Technique with Uniform, Treatment of Dimensions, IEEE Symposium on Information Visualization 2000. Salt Lake City, Utah. pp.4-8 (2000) [KaR90] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John While & Sons. (1990) [KDN+96] T. Kanungo, B. Dom, W. Niblack, and D. Steele, "A fast algorithm for mdl-based multi band image segmentation", in Image Technology,J. Sanz, Ed. Springer-Verlag, 1996. [KeC00] M. K. Kerr and G. A. Churchill, "Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments", Proceedings of the National Academy of Sciences (2000) [Kei02] D. A. Keim, Information Visualization and Data Mining, IEEE Transactions on Visualization and Computer Graphics, Vol. 7(1), January-March 2002, pp.100-107 (2002) [KeK94] D. A. Keim and H.-P. Kriegel, VisDB: Database Exploration Using Multidimensional Visualization, IEEE Computer Graphics and Applications, 14(5) pp. 4049 (1994) [KHK99] G. Karypis, E.-H. S Han, and V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, IEEE Computer, vol. 32(8), pp.6875 (1999) [KrW78] J. B.Kruskal, M. Wish, Multidimensional Scaling, SAGE university paper series on quantitive applications in the social sciences, Sage Publications, CA. pp. 07-011 (1978) [KKA95] D.A. Keim, H.-P. Kriegel, M. Ankerst, Recursive Pattern: A Technique for Visualizing Very Large Amounts of Data, Proceedings of Visualization 95, Atlanta, GA, pp. 279-286 (1995) [KOC+04] S. Kabelac, S. Olbrich, K. Chmielewski, K. Meier, C. Holzknecht, 3D Visualization of Molecular Simulations in High-performance Parallel Computing Environments, Journal of Molecular Simulation, Volume 30(7), June 2004, Taylor and Francis Ltd. pp. 469-477 (2004) [Koh97] T. Kohonen, Self-Organizing Maps Springer, Berlin, second extended edition (1997) [KSP01] S. Kaski, J. Sinkkonen. and J. Peltonen, Data Visualization and Analysis with SelfOrganizing Maps in Learning Metrics, DaWaK 2001, LNCS 2114, pp.162-173 (2001)

66

BIBLIOGRAPHY

[LeD01] E. Levine and E. Domany, "Resampling Method for Unsupervised Estimation of Cluster Validity", Neural Computation. 2001. [Lev91] H. Levkowitz, Color icons: Merging color and texture perception for integrated visualization of multiple parameters, Proceedings of the 2nd conference on Visualization '91, San Diego, CA, pp. 164-170 (1991) [LKS+04] J. Lin, E. Keogh, S. Lonardi, J. Lankford and D. M. Nystrom, Visually Mining and Monitoring Massive Time Series, KDD 04, August 22-25, 2004, Seattle, Washington, U.S.A (2004) [Mac67] J. B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, pp.281-297 (1967) [MaT03] K.-L. Ma, S. T. Teoh, StarClass: Interactive Visual Classification Using Star Coordinates, Proceedings of the 3rd SIAM International Conference on Data Mining, pp. 178-185 (2003) [MaW] The MathWorks, Inc. textbook online, http://www.mathworks.com/ [MiG04] James R Miller, E A Gustavo; "The Immersive Visualization Probe for Exploring nDimensional Spaces", Proceedings of IEEE Computer Graphics and Applications 2004, pp.76-85 (2004) [MiI80] G. W. Milligan, and P. D. Isaac, The validation of four ultrametric clustering algorithms, Pattern Recognition, Vol. 12, pp.41-50 (1980) [Mil81] G. W. Milligan, A Monte Carlo study of thirty internal criterion measures for cluster analysis, Psychometrika, Vol. 46 (2), pp. 187-199 (1981) [Mil96] G. W. Milligan, Clustering validation: results and implications for applied analysis. in Clustering and Classification ed. P. Arabie, L. J. Hubert and G. (1996) De Soete, World Scientific, pp.34 1-375. [MRC02] A. Morrison, G. Ross and M. Chalmers, Combining and comparing clustering and layout algorithms, University of Glasgow (2002) [MSS83] G.W. Milligan, L.M. Sokol, and S.C. Soon The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure, IEEE Trans PAMI, Vol. 5(1), pp. 40-47 (1983) [OlL03] F. Oliveira, H. Levkowitz, From Visual Data Exploration to Visual Data Mining: A Survey, IEEE Trans.Vis.Comput. Graph, Volume 9(3), pp.378-394 (2003) [OlP07] J. V. de Oliveira and W. Pedrycz (Editor), Advances in Fuzzy Clustering and its Applications, Wiley, June (2007) [Pei] http://www.cs.sfu.ca/~jpei/

67

BIBLIOGRAPHY

[PGW03] E. Pampalk, W. Goebl, and G. Widmer, Visualizing Changes in the Structure of Data for Exploratory Feature Selection, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD 03), August 24-27, 2003, Washington, DC, USA pp.157-166 (2003) [Pic70] R. M. Pickett, Visual Analyses of Texture in the Detection and Recognition of Objects, in: Picture Processing and Psycho-Pictorics, Lipkin B. S., Rosenfeld A. (eds.), Academic Press, New York (1970) [Ran71] W. M. Rand, Objective Criteria for the Evaluation of Clustering Methods, Journal of the American Statistical Association, Vol 66, pp. 846-850 (1971) [Rei95] S. P. Reiss, An Engine for the 3D Visualization of Program Information, Journal of Visual Languages and Computing, Vol. 6, pp. 299-323 (1995) [RBL+02] V. Roth, M. L. Braun, T. Lange and J. M. Buhmann "Stability-Based Model Order Selection in Clustering with Applications to Gene Expression Data", Lecture Notes In Computer Science; Vol. 2415, Proceedings of the International Conference on Artificial Neural Networks, pp.607-612 (2002) [RKJ+99] W. Ribarsky, J. Katz, F. Jiang, A. Holland, Discovery Visualization using Fast Clustering, IEEE Computer Graphics and Applications, Vol. 19(5) 1999. [RSE99] R.M. Rohrer, J.L. Sibert, D.S. Ebert, Shape-based Visual Interface for Text Retrieval, IEEE Computer Graphics and Applications, Vol. 19(5) 1999. [SBG00] T.C. Sprenger, R. Brunella, M. H.Gross, H-BLOB: a hierarchical visual clustering method using implicit surfaces, Proceedings of Visualization 2000, pp. 61-68 (2000) [SCZ98] G. Sheikholeslami, S. Chatterjee, A. Zhang, Wavecluster: A multi-resolution clustering approach for very large spatial databases, Proceedings of Very Large Databases Conference (VLDB98), pp.428-439 (1998) [SDT+95] H. Su, H. Dawkes, L. Tweedie, R. Spence, An Interactive Visualization Tool for Tolerance Design, Technical Report, Imperial College, London, (1995) [SeS05] J. Seo and B. Shneiderman, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, Essays Dedicated to Erich J. Neuhold on the Occasion of His 65th Birthday. Lecture Notes in Computer Science Vol.3379, Springer (2005) [SGF71] J.H. Siegel, R. M. Goldwyn and H. P. Friedman, Irregular polygon to represent multivariate data (with vertices of equal intervals, distanced from the centre proportionally to the value of the variable), USA (1971) October [Sha96] S. Sharma, Applied multivariate techniques, John Wiley & Sons, Inc. (1996) [Shn01] B. Shneiderman, Inventing Discovery Tools: Combining Information Visualization with Data Mining, Proceedings of Discovery Science 2001,Lecture Notes in Computer Science Vol 2226, pp.17-28 (2001)

68

BIBLIOGRAPHY

[Shn02] B. Shneiderman,, Inventing discovery tools: Combining information visualization with data mining, Information Visualization, Vol. 1, pp.512 (2002) [ShY06] J. Shaik and M. Yeasin, Visualization of High Dimensional Data using an Automated 3D Star Co-ordinate System, Proceedings of International Joint Conference on Neural Networks, 2006 (IJCNN '06), 16-21 July 2006,Vancouver, Canada, IEEE Press, pp.1339-1346 [Sim93] P. K. Simpson, Fuzzy min-max neural networkPart II: Clustering,, IEEE Trans. Fuzzy Syst., Vol. 1(1), pp. 3245 (1993) [Smi86] W. A. Smith, Elementary Numerical Analysis, Prentice-Hall, (1986) [Thk99] S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press. 1999. [VSA05] R.Vilalta, T. Stepinski, M. Achari, An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-0508, Department of Computer Science, University of Houston (2005) [War95] M. Ward, High dimensional brushing for interactive exploration of multivariate data, Proceedings of Visualization'95, pp.271-278 (1995.) [WiL93] J. J van Wijk., R. D. van Liere, Hyperslice, Proceedings of Visualization 93 Conference, San Jose, CA, pp.119-125 (1993) [WTP+95] J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, V. Crow, Visualizing the Non-Visual: Spatial Analysis and Interaction with Information from Text Document, Proceedings of Symposium on Information Visualization1995, Atlanta, GA, pp.51-58 (1995) [WYM97] W. Wang, J. Yang, and R. Muntz, STING: A statistical information grid approach to spatial data mining, Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB97), pp.186-195 (1997) [XEK+98] X. Xu, M. Ester, H-P. Kriegel, and J. Sander, A distribution-based clustering algorithm for mining in large spatial databases, Proceedings of IEEE International Conference on Data Engineering, (ICDE 98), pp.324-331 (1998) [XuW05] R. Xu and D. C. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks, Vol. 16(3), May 2005, pp.645-678 (2005) [Yan03] L. Yang, Visual Exploration of Large Relational Data Sets through 3D Projections and Footprint Splatting, IEEE Transactions on Knowledge and Data Engineering, Vol. 15(6), pp.1460-1471, November/December (2003) [ZHW+03] X. Zheng, P. He, F. Wan, Z. Wang, G. Wu, Dynamic Clustering Analysis of Documents Based on Cluster Centroids, Proceedings of the Second International Conference on Machine Learning and Cybernetics, XiAn, 2-5 Nov. 2003, IEEE Press. Vol.1, pp.194-198 (2003)

69

BIBLIOGRAPHY

[ZOZ+06] K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the working conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp. 254-257 (2006) [ZOZ06] K-B, Zhang, M. A. Orgun, K. Zhang, HOV3: An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp.316-327 (2006) [ZOZ07a] K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press. pp. 576-582 (2007) [ZOZ07b] K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp. 288-300 (2007) [ZOZ07c] K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 1721, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp. 336349 (2007) [ZOZ07d] K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (2007) (submitted) [ZRL96] T. Zhang, R. Ramakrishana and M. Livny, "An Efficient Data Clustering Method for Very Large Database", Proceedings of ACM SIGMOD International Conference on Management of Data, pp.103-114 (1996)

70

Hypothesis Oriented Cluster Analysis in Data Mining by Visualization


Ke-Bing Zhang Mehmet A. Orgun Kang Zhang
Department of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688 USA 1-972-883 6351

Yihao Zhang
Department of Computing Macquarie University Sydney, NSW 2109, Australia 612- 9850 9590

Department of Computing Macquarie University Sydney, NSW 2109, Australia 612- 9850 9590, 612- 9850 9570

{kebing, mehmet}@ics.mq.edu.au

kzhang@utdallas.edu

yihao@ics.mq.edu.au

. ! " # $ % & ! ! $

& ! & 2* = 4 $ ! 9 & . 2-*4 1

' ( $
'() $ &

(
% *+

(
'() & .

$ 2->4 . $ ! 2? 4 9 $

,
( ( $ + & (.

2)4 # 9

- ./
1

'+ 0
23 4 #

.'/ .'/

*
+

:@

: + # ' , A '0
$ 2->4 +

<< '
9

2-) 5 -- 6 - 7 4 8 2-)4 & # 9 2--4 8 ; + : 27 4 ! 9 9 8 & B 8 C 9 + / 25 4 '< . 2-4 & 9 0 ! : 26 4 . 9

2? 4 &

*2? 4 & & 9 9 & 9& & &

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. AVI '06, May 23-26, 2006, Venezia, Italy. Copyright 2006 ACM 1-59593-353-0/06/0005...$5.00

254

1 % <

'() & B C
*

. K FD I 8 8 - < D-E
p j ( x, y ) = (
%

@
n

FD

L M

E F
k =1

N F - -O * *O L O

bka k
A, B A B

D*E

?& P P
n n

M cos( ) = P P M
u yi ( d ji min i ))

u xi ( d ji min i ),
i =1 i =1

D-E @

P PF

a1 2 + a 2 2 + ... + an 2
" D ?

D9

E Du
xi
C
j

+ %

u yi E
%F

P P F b1 2 + b 2 2 + ... + bn 2 I N < E

" G % >H %I J

uj =

j j

man min

9 %F
+

9 G % > H % I J"
& 8 % 9 9 ! . D D E E % @ + %FD %D*E
%*

& !
8 * ( % <

9 0

1 9 + % 1 D % E" E 9+ % 1 1 FD - * L % E % D F- L E + % 1 M
n

**'
:9 D

&&

'()
E D $ E &

I % 1 NF & 8M
*

%-O

% *O

LO
M

F
k =1

m i d jk
D E

D)E

9 !

% 9 & 8 D+ % 1 EFDI+ % 1 N E F # ' '() ( ( ! '() 1 $ ! % ! @ # 9 . 9 ! . E + % 9/ 1 @ D1 @ 1 D5 E

$F9 O
: Me
ix

= cos x + i sin x

255

@ D

z 0 = e 2 i/n " $> F-E


M
n

$>- $>* $>) L L $>


9

&-

$>
&1 ! '() $

D-E

p j ( z0 ) =
k =1

k [( d jk min d jk ) /( max d kj min d jk )] z 0 k k k

D6 E 9

min d jk
k

max d kj
k

) : B < : .1 : /
(. 1 @ 9 M -> (. MKK

# .
9 2)4 # # KR -6 >>> 1 9 (. ! @ .

'()
'() *>>> < K1 8 '() 0 &@ .

!
&
n

K M

p j (zk ) =
k =1

[( d jk min d jk ) /( max d kj min d jk )] z k D=E


k k k

z k = e i k "
n

9 "
k = 2
k =1

# D6 E

(d jk min d jk ) /( max d kj min d jk )


k k k

)>

&

D= E D= E
n

$ M

d jk ,

dN jk
p j (zk ) =
k =1

dN jk z k

D3 E &

. 9 Q <% D5 E
p j (zk ) =
k =1
!

&$ % D3 E D? EM
n

<% '()
zk

(.

dN jk m k

D? E

!
2? 4 (. 2)4 9 %

*)+
.

D3 E

'()

D? E 8
! D!F- L

'()

E (. K 9 % D E 8 6

D3 E 9 9

D5 E D? E '() . 9 2> -4 2&- -4 % (.

256

6 %

! '() $ ! . 9 9 # '() # 9 0

:9 & .

5
.

'/ @ 0 + ./Q
& $

:1
&

,
'() % % '() '() &

8 #

6 <

% (. 8 = '() 9 1 9 & :9 '() $

6
2-4 ' 2*4 @ 2)4 8

: 8: : / :
! ! 1 1 , Q 1 1 -7 7 7 + ( , + *>>1 <

.Q 1 '+
+

< <

S '< .

<

:9

+ $
B

D< , + + T>-E
( ) D5 E *6 3 & &

, @ @ (. M( ( $ S . *3 > *>>5 25 4 : 1 , <

(
S B

26 4 Q 2= 4

-7 7 = , Q Q 0

, <
:M

+ 1 $
$ < M

-7 7 ? < : &

'

.Q 1 '+

8 # '() (.

1 9 8 3

'()

. 23 4 S 2? 4 , 27 4 U : 8 2->4 2--4

!
1

*>>1 / $

: (

< S + )-D)E *= 5 &)*) -7 7 7 &

->3 &--= *>>C V Q V , M 8 ; + :M 8 + + / + . < 1 .Q 1 '+ *>>5 < -)&-? S *>>5 1 < 7 *-&7 ** *>>5 ! # , $ S S 8 + $ Q .: : : ( -7 D6 E )* W )7 -7 7 7 ! Q % V # M &

<

.Q , + +

8 .

3 1 '() (.

9 '()
257

-7 7 ? 2-*4 M . . ( $ @ -3 &*? *>>- < *>>2-)4V !

<

D(@ +

+ + @

M 1 1 + ***= . M

.Q 1 '+ 7 = 1

->)&--5 -7 7 =

. <

HOV3: An Approach to Visual Cluster Analysis


Ke-Bing Zhang1, Mehmet A. Orgun1, and Kang Zhang2
1 Department

of Computing, Macquarie University, Sydney, NSW 2109, Australia {kebing, mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas Richardson, TX 75083-0688, USA kzhang@utdallas.edu

Abstract. Clustering is a major technique in data mining. However the numerical feedback of clustering algorithms is difficult for user to have an intuitive overview of the dataset that they deal with. Visualization has been proven to be very helpful for high-dimensional data analysis. Therefore it is desirable to introduce visualization techniques with users domain knowledge into clustering process. Whereas most existing visualization techniques used in clustering are exploration oriented. Inevitably, they are mainly stochastic and subjective in nature. In this paper, we introduce an approach called HOV3 (Hypothesis Oriented Verification and Validation by Visualization), which projects high-dimensional data on the 2D space and reflects data distribution based on user hypotheses. In addition, HOV3 enables user to adjust hypotheses iteratively in order to obtain an optimized view. As a result, HOV3 provides user an efficient and effective visualization method to explore cluster information.

1 Introduction
Clustering is an important technique that has been successfully used in data mining. The goal of clustering is to distinguish objects into groups (clusters) based on given criteria. In data mining, the datasets used in clustering are normally huge and in high dimensions. Nowadays, clustering process is mainly performed by computers with automated clustering algorithms. However, those algorithms favor clustering spherical or regular shaped datasets, but are not very effective to deal with arbitrarily shaped clusters. This is because they are based on the assumption that datasets have a regular cluster distribution. Several efforts have been made to deal with datasets with arbitrarily shaped data distributions [2], [11], [9], [21], [23], [25]. However, those approaches still have some drawbacks in handling irregular shaped clusters. For example, CURE [11], FAADE [21] and BIRCH [25] perform well in low dimensional datasets, however as the number of dimension increases, they encounter high computational complexity. Other approaches such as density-based clustering techniques DBSCAN [9] and OPTICS [2], and wavelet based clustering WaveCluster [23] attempt to cope with this problem, but their non-linear complexity often makes them unsuitable in the analysis of very large datasets. In high dimensional spaces, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well
X. Li, O.R. Zaiane, and Z. Li (Eds.): ADMA 2006, LNAI 4093, pp. 316 327, 2006. Springer-Verlag Berlin Heidelberg 2006

HOV3: An Approach to Visual Cluster Analysis

317

anymore. The recent clustering algorithms applied in data mining are surveyed by Jain et al [15] and Berkhin [4]. Visual Data Mining is mainly a combination of information visualization and data mining. In the data mining process, visualization can provide data miners with intuitive feedback on data analysis and support decision-making activities. In addition, visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [24]. Many visualization techniques have been employed to study the structure of datasets in the applications of cluster analysis [18]. However, in practice, those visualization techniques take the problem of cluster visualization simply as a layout problem. Several visualization techniques have been developed for cluster discovery [2], [6], [16], but they are more exploration oriented, i.e., stochastic and subjective in the cluster discovery process. In this paper, we propose a novel approach, named HOV3, Hypothesis Oriented Verification and Validation by Visualization, which projects the data distribution based on given hypotheses by visualization in 2D space. Our approach adopts the user hypotheses (quantitative domain knowledge) as measures in the cluster discovery process to reveal the gaps of data distribution to the measures. It is more object/goal oriented and measurable. The rest of this paper is organized as follows. Section 2 briefly reviews related work on cluster analysis and visualization in data mining. Section 3 provides a more detailed account of our approach HOV3 and its mathematical description. Section 4 demonstrates the application of our approach on several well-known datasets in data mining area to show its effectiveness. Finally, section 5 evaluates our approach and provides a succinct summary.

2 Related Work
Cluster analysis is to find patterns (clusters) and relations among the patterns in large multi-dimensional datasets. In high-dimensional spaces, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well anymore. Thus, using visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [19]. Many studies have been performed on high dimensional data visualization [18]. While, most of those visualization approaches have difficulty in dealing with high dimensional and very large datasets, for example, icon-based methods [7], [17], [20] can display high dimensional properties of data. However, as the amount of data increases substantially, the user may find it hard to understand most properties of data intuitively, since the user cannot focus on the details of each icon. Plot-based data visualization approaches such as Scatterplot-Matrices [8] and similar techniques [1], [5] visualize data in rows and columns of cells containing simple graphical depictions. This kind of a technique gives bi-attributes visual information, but does not give the best overview of the whole dataset. As a result, they are not able to present clusters in the dataset very well.

318

K.-B. Zhang, M.A. Orgun, and K. Zhang

Parallel Coordinates [14] utilizes equidistant parallel axes to visualize each attribute of a given dataset and projects multiple dimensions on a two-dimensional surface. Star Plots [10] arranges coordinate axes on a circle space with equal angles between neighbouring axes from the centre of a circle and links data points on each axis by lines to form a star. In principle, those techniques can provide visual presentations of any number of attributes. However, neither parallel coordinates nor star plots is adequate to give the user a clear overall insight of data distribution when the dataset is huge, primarily due to the unavoidably high overlapping. And another drawback of these two techniques is that while they can supply a more intuitive visual relationship between the neighbouring axes, for the non-neighbouring axes, the visual presentation may confuse the users perception. HD-Eye [12] is an interactive visual clustering system based on density-plots of any two interesting dimensions. The 1D visualization based OPTICS [2] works well in finding the basic arbitrarily shaped clusters. But they lack the ability in helping the user understand inter-cluster relationships. The approaches that are most relevant to our research are Star Coordinates [16] and its extensions, such as VISTA [6]. Star Coordinates arranges coordinate axes on a two-dimensional surface, where each axis shares the same origin point. This approach utilizes a point to represent a vector element. We give a more detailed discussion on Star Coordinates in contrast with our model in the next section. The recent surveys [3], [13] provide a comprehensive summary on high dimensional visualization approaches in data mining.

3 Our Approach
Data mining approaches are roughly categorized into discovery driven and verification driven [22]. Discovery driven approaches attempt to discover information by using appropriate tools or algorithms automatically, while verification driven approaches aim at validating a hypothesis derived from user domain knowledge. Discovery driven method can be regarded as discovering information by exploration, and the verification driven approach can be thought of as discovering information by verification. Star Coordinates [16] is a good choice as an exploration discovery tool for cluster analysis in a high dimensional setting. Star Coordinates technique and its salient features are briefly presented below. 3.1 Star Coordinates Star Coordinates arranges values of n-attributes of a database to n-dimensional coordinates on a 2D plane. The minimum data value on each dimension is mapped to the origin, and the maximum value, is mapped to the other end of the coordinate axis. Then unit vectors on each coordinate axis are calculated accordingly to allow scaling of data values to the length of the coordinate axes. Finally the values on ndimensional coordinates are mapped to the orthogonal coordinates X and Y, which share the origin point with n-dimensional coordinates. Star Coordinates uses x-y values to represent a set of points on the two-dimensional surface, as shown in Fig.1.

HOV3: An Approach to Visual Cluster Analysis

319

Fig. 1. Positioning a point by an 8-attribute vector in Star Coordinates [16]

Formula (1) states the mathematical description of Star Coordinates.


p j ( x, y ) = (

u xi (d ji min i ), u yi (d ji min i ))
r r
i =1 i =1

(1)

pj (x, y) is the normalized location of Dj=(dj1, dj2, ., djn), where dji is the coordinates point of jth record of a dataset on Ci, the ith coordinate in Star Coordinates space. r r And u xi (d ji min i ) and u yi (d ji min i ) are unit vectors of dji mapping to X direction and Y direction respectively, where u i = C i /(max i min i ) ,in which mini=min(dji,0j< m), maxi=max(dji, 0 j <m), where m is the number of records in the dataset. By mapping high-dimensional data into two-dimensional space, Star Coordinates inevitably produces data overlapping and ambiguities in visual form. For mitigating these drawbacks, Star Coordinates established visual adjustment mechanisms, such as scaling the weight of attributes of a particular axis; rotating angles between axes; marking data points in a certain area by coloring; selecting data value ranges on one or more axes and marking the corresponding data points in the visualization [16]. However, Star Coordinates is a typical method of exploration discovery. Using numerically supported (quantitative) cluster analysis is time consuming and inefficient, while using visual (qualitative) clustering approaches, such as Star Coordinates is subjective, stochastic, and less of preciseness. To solve the problem of precision of visual cluster analysis, we introduce a new approach in the next section.
3.2 Our Approach HOV3
r

Having a precise overview of data distribution in the early stages of data mining is important, because having correct insights of data overview is helpful for data miners to make decisions on adopting appropriate algorithms for the forthcoming analysis stages.
3.2.1 Basic Idea Exploration discovery (qualitative analysis) is regarded as a pre-processing of verification discovery (quantitative analysis), which is mainly used for building user hy-

320

K.-B. Zhang, M.A. Orgun, and K. Zhang

potheses based on cluster detection, or other techniques. But it is not an aimless and/or arbitrary process. Exploration discovery is an iterative process under the guidance of user domain knowledge. Each of iterations of exploration feeds back users new insight and enriches their domain knowledge on the dataset that they are dealing with. However, the way in which the qualitative analysis is done by visualization mostly depends on each individual users experience. Thus subjectivity, randomness and lack of preciseness may be introduced in exploration discovery. As a result, quantitative analysis based on the result of imprecise qualitative analysis may be inefficient and ineffective. To fill the gap between the imprecise cluster detection by visualization and the unintuitive result by clustering algorithms, we propose a new approach, called HOV3, which is a quantified knowledge based analysis and provides a bridging process between qualitative analysis and quantitative analysis. HOV3 synthesizes the feedbacks from exploration discovery and user domain knowledge to produce quantified measures, and then projects test dataset against the measures. Geometrically, HOV3 reveals data distribution against the measures in visual form. We give the mathematical description of HOV3 below.
3.2.2 Mathematic Model of HOV3 To project a high-dimensional space into a two-dimensional surface, we adopt the Polar Coordinates representation. Thus any vector can be easily transformed to the orthogonal coordinates X and Y. In analytic geometry, the difference of two vectors A and B can be presented by their inner/dot product, A.B. Let A=(a1, a2, , an) and B=( b1, b2, , bn), then their inner product can be written as: <A, B>=a1.b1+a2.b2+ +an.bn =

a
k =1

kbk

(2 )

Fig. 2. Vector B projected against vector A in Polar Coordinates

HOV3: An Approach to Visual Cluster Analysis

321

Then we have the equation:


cos( ) = < A, B > | A || B |

where q is the angle between A and B, and |A| and |B| are the lengths of A and B correspondingly, as shown below:
|A|=

a1 2 + a 2 2 + ... + an 2 and |B|= b1 2 + b2 2 + ... + bn 2 .

Let A be a unit vector; the geometry of <A, B> in Polar Coordinates presents the gap from point B (db, q) to point A, as shown in Fig. 2, where A and B are in 8 dimensional space.
Mapping to Measures

In the same way, a matrix Dj, a set of vectors (dataset) also can be mapped to a measure vector M. As a result, it projects the matrix Dj distribution based on the vector M. Let Dj=(dj1, dj2, , djn) and M=(m1, m2, , mn), then the inner product of each vector dji, (i =1, , n) of Dj with M has the same equation as (2) and written as:
<dji, M>= m1.dj1+m2.dj2+ +mn.djn =

m
k =1

k d jk

(3)

So from an n-dimensional dataset to one measure (dimension) mapping F: Rn R2 can be defined as:

F (Dj, M)=(<Dj,M>)=

.*(m1, m2, , mn) =

(4)

Where Dj is a dataset with n attributes, and M is a quantified measure.


In Complex Number System

Since our experiments are run by MATLAB (MATLAB, the MathWorks, Inc), in order to better understand our approach, we use complex number system. Let z = x + i.y, where i is the imaginary unit. According to the Euler formula:
e ix = cos x + i sin x Let z 0 = e 2 i/n ; we see that z01, z02, z03, , z0n-1, z0n (with z0n = 1) divide the unit circle on the complex plane into n-1 equal sectors. Then mapping in Star Coordinates (1) can now be simply written as:
p j (z0 ) =

[(d jk min d jk ) /( max d kj min d jk )] z0k k k k


k =1

where,

min d jk and max d kj represents minimal and maximal values of the kth atk

tribute/coordinate respectively.

(5)

322

K.-B. Zhang, M.A. Orgun, and K. Zhang

This is the case of equal-divided circle surface. Then the more general form can be defined as:
p j (zk ) =

[(d jk min d jk ) /( max d kj min d jk )]z k k k k


k =1

(6)

where z k = e i k ; is the angle of neighbouring axes; and


k k k

k = 2 .
k =1

While, the part of (d jk min d jk ) /(max d kj min d jk ) in (5) (6) is normalized of original djk, we write it as dNjk.
.

Thus formula (6) is written as:


p j (zk ) =
R C2.
n

d
k =1

N jk z k

(7)

In any case these can be viewed as mappings from Rn to C - the complex plane, i.e.,

Given a non-zero measure vector m in Rn, and a family of vectors Pj, then the projections of Pj against m according to formulas (4) and (7), we present our model HOV3 as the following equation (8):
p j (zk ) =

d
k =1

N jk m k

zk

(8)

where mk is the kth attribute of measure m.


3.2.3 Discussion In Star Coordinates, the purpose of scaling the weight of attributes of a particular axis (or -mapping called in VISTA) is for adjusting the contribution of the attribute laid on a specific coordinate by the interactive actions, so that data miners might gain some interesting cluster information that automated clustering algorithms cannot easily provide [6], [16]. Thus, comparing the model of Star Coordinates, in equation (7), and our model HOV3 in equation (8), we may observe that our model covers the model of Star Coordinates, in that the condition of the angle of coordinates is the same in both models. This is because, any change of weights in Star Coordinates model can be viewed as changing one or more values of mk (k=1,,n) in measure vector m in equation (8) or (4). As a special case, when all values in m are set to 1, it is clear that HOV3 is transformed into Star Coordinates model (7), i.e., no measure case. In addition, either moving a coordinate axis to its opposite direction or scaling up the adjustment interval of axis, for example, from [0,1] to [-1,1] in VISTA, is also regarded as negating the original measure value. Moreover, as a bridge between qualitative analysis and quantitative analysis, HOV3 not only supports quantified domain knowledge verification and validation, but also can directly utilize the rich statistical analysis tools as measures and guide data miners with additional cluster information. We demonstrate several examples running in MATLAB in comparison to the same dataset running in VISTA system [6] in the next section.

HOV3: An Approach to Visual Cluster Analysis

323

4 Examples and Explanation


In this section, we present several examples to demonstrate the advantages of using HOV3. We have implemented our approach in MATLAB running under Windows 2000 Professional. The results of our experiments with HOV3 are compared to those of VISTA, a Star Coordinates based system [6]. At this stage, we only employed several simple statistical methods on those datasets as measures. The datasets used in the examples are well known and can be obtained from the UCI machine learning website: http://www.ics.uci.edu/~mlearn/Machine-Learning.html.
4.1 Iris

Iris dataset is perhaps the best-known in the pattern recognition literature. Iris has 3 classes, 4 numeric attributes and 150 instances. The diagram presented in Fig. 3 (left) is the initial data distribution in Star Coordinates produced by the VISTA system. Fig 3 (right) shows the result of data distribution presented by HOV3 without any adopted measures. It can be observed that the shapes of data distribution are almost identical in the two figures. Only the directions for two shapes are little bit different, since VISTA shifted the appearance of data by 30 degrees in counter-clockwise direction.

Fig. 3. The original data distribution in VISTA system (left) and its distribution by HOV3 in MATLAB (right)

Fig. 4 illustrates the results after several random weight adjustment steps. In Fig. 4, it can be observed very clearly that there are three data groups (clusters). The initial data distribution cannot provide data miners a clear idea about the clusters, see Fig.3 (left). Thus, in VISTA the user may verify them by further interactive actions, such as weight scaling and/or changing angles of axes. However, though sometimes better results may appear, as shown in Fig.4, even users do not know where the results came from, because this adjustment process is pretty stochastic and not easily repeatable.

324

K.-B. Zhang, M.A. Orgun, and K. Zhang

Fig. 4. The labeled clusters in VISTA after performing random adjustments by the system

Fig. 5. Projecting iris data against to its mean (left), and Iris data projection against to its standard division (right)

We use simple statistical methods such as mean and standard division of Iris as measures to detect cluster information. Fig.5 gives the data projections based on these measures respectively. HOV3also provides three data groups, and in addition, several outliers. Moreover, the user can clearly understand how the results came about, and iteratively perform experiments with the same measures.
4.2 Shuttle

Shuttle dataset is much bigger both in size and in attributes than Iris. It has 10 attributes and 15,000 instances. Fig.6 illustrates the initial Shuttle data distribution, the same for both VISTA and HOV3. The clustered data is illustrated in Fig.7 after performing manual weight scaling of axes in VISTA, where clusters are marked by different colours. We used the median and the covariance matrix of Shuttle to detect the gaps of Shuttle dataset against its median and covariance matrix. The detected results are shown in Fig. 8. These distributions provide the user with different cluster information as in VISTA. On the other hand, HOV3 can repeat the exact performance as VISTA did, if the user can record each weight scaling and quantified them, as mentioned in equation (8), HOV3 model subsumes Star Coordinates based techniques.

HOV3: An Approach to Visual Cluster Analysis

325

Fig. 6. Left: the initial shuttle data distribution in VISTA. Right: the initial shuttle data distribution in HOV3.

Fig. 7. Post adjustment of the Shuttle data with colored labels in VISTA

Fig. 8. Mapping shuttle dataset against to its median by HOV3 (left) and mapping shuttle dataset against to its covariance matrix (right)

326

K.-B. Zhang, M.A. Orgun, and K. Zhang

The experiments we performed on the Shuttle dataset also show that HOV3 has the capability to provide users an efficient and effective method to verify their hypotheses by visualization. As a result, HOV3 can feed back more precise visual performance of data distribution to users.

5 Conclusions
In this paper we have proposed a novel approach called HOV3 to assist data miners in cluster analysis of high-dimensional datasets by visualization. The HOV3 visualization technique employs hypothesis oriented measures to project data and allows users to iteratively adjust the measures for optimizing the result of clusters. Experiments show that HOV3 technique can improve the effectiveness of the cluster analysis by visualization and provide a better, intuitive understanding of the results. HOV3 can be seen as a bridging process between qualitative analysis and quantitative analysis. It not only supports quantified domain knowledge verification and validation, but also can directly utilize the rich statistical analysis tools as measures and give data miners an efficient and effective guidance to get more precise cluster information in data mining. Iteration is a commonly used method in numerical analysis to find the optimized solution. HOV3 supports verification by quantified measures, thus provides us an opportunity to detect clusters in data mining by combining HOV3 and iteration method. This is the future goal of our work.

Acknowledgement
We would like to thank Kewei Zhang for his valuable support on mathematics of this work. We also would like to express our sincere appreciation to Keke Chen and Ling Liu for offering their VISTA system code, which greatly accelerated our work.

References
1. Alpern B. and Carter L.: Hyperbox. Proc. Visualization 91, San Diego, CA (1991) 133-139 2. Ankerst M., Breunig MM., Kriegel, Sander HP. J.: OPTICS: Ordering points to identify the clustering structure. Proc. of ACM SIGMOD Conference (1999) 49-60 3. Ankerst M., and Keim D.: Visual Data Mining and Exploration of Large Databases. 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), Freiburg, Germany, September (2001) 4. Berkhin P.: Survey of clustering data mining techniques. Technical report, Accrue Software (2002) 5. Cook D.R., Buja A., Cabrea J., and Hurley H.: Grand tour and projection pursuit. Journal of Computational and Graphical Statistics Volume: 23 (1995) 225-250 6. Chen K. and Liu L.: VISTA: Validating and Refining Clusters via Visualization. Journal of Information Visualization Volume: l3 (4) (2004) 257-270

HOV3: An Approach to Visual Cluster Analysis

327

7. Chernoff H.: The Use of Faces to Represent Points in k-Dimensional Space Graphically. Journal Amer. Statistical Association, Volume: 68 (1973) 361-368 8. Cleveland W.S.: Visualizing Data. AT&T Bell Laboratories, Murray Hill, NJ, Hobart Press, Summit NJ. (1993) 9. Ester M., Kriegel HP., Sander J., Xu X.: A density-based algorithm for discovering clusters in large spatial databases with noise. 2nd International Conference on Knowledge Discovery and Data Mining (1996) 10. Fienberg S. E.: Graphical methods in statistics. American Statisticians Volume: 33 (1979) 165-178 11. Guha S., Rastogi R., Shim K.: CURE: An efficient clustering algorithm for large databases. In Proc. of ACM SIGMOD Int'l Conf. on Management of Data, ACM Press (1998) 73--84 12. Hinneburg, A. Keim D. A., Wawryniuk M.: HD-Eye-Visual Clustering of High dimensional Data. Proc. of the 19th International Conference on Data Engineering, (2003) 753755 13. Hoffman P. E. and Grinstein G.: A survey of visualizations for high-dimensional data mining. In Fayyad U., Grinstein G. G. and Wierse A. (eds.) Information visualization in data mining and knowledge discovery, Morgan Kaufmann Publishers Inc. (2002) 47-82 14. Inselberg A.: Multidimensional Detective. Proc. of IEEE Information Visualization '97 (1997) 100-107 15. Jain A., Murty M. N., and Flynn P.J.: Data Clustering: A Review. ACM Computing Surveys Volume: 31(3) (1999) 264-323 16. Kandogan E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. Proc. of ACM SIGKDD Conference, (2001) 107-116 17. Keim D.A. And Kriegel HP.: VisDB: Database Exploration using Multidimensional Visualization. Computer Graphics & Applications (1994) 40-49 18. Maria Cristina Ferreira de Oliveira, Haim Levkowitz: From Visual Data Exploration to Visual Data Mining: A Survey. IEEE Transaction on Visualization and Computer Graphs Volume: 9(3) (2003) 378-394 19. Pampalk E., Goebl W., and Widmer G.: Visualizing Changes in the Structure of Data for Exploratory Feature Selection. SIGKDD 03, Washington, DC, USA (2003) 20. Pickett R. M.: Visual Analyses of Texture in the Detection and Recognition of Objects. Picture Processing and Psycho-Pictorics, Lipkin B. S., Rosenfeld A. (eds.) Academic Press, New York, (1970) 289-308 21. Qian Y., Zhang G., and Zhang K.: FAADE: A Fast and Effective Approach to the Discovery of Dense Clusters in Noisy Spatial Data. In Proc. ACM SIGMOD 2004 Conference, ACM Press (2004) 921-922 22. Ribarsky W., Katz J., Jiang F. and Holland A.: Discovery visualization using fast clustering. Computer Graphics and Applications, IEEE, Volume: 19 (1999) 32-39 23. Sheikholeslami G., Chatterjee S., Zhang A.: WaveCluster: A multi-resolution clustering approach for very large spatial databases. Proc. of 24th Intl. Conf. On Very Large Data Bases (1998) 428-439. 24. Shneiderman B.: Inventing Discovery Tools: Combining Information Visualization with Data Mining. Discovery Science 2001, Proceedings. Lecture Notes in Computer Science Volume: 2226 (2001) 17-28 25. Zhang T., Ramakrishnan R. and Livny M.: BIRCH: An efficient data clustering method for very large databases. In Proc. of SIGMOD96, Montreal, Canada (1996) 103-114

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)

A Visual Approach for External Cluster Validation


Ke-Bing Zhang, Mehmet A. Orgun, Senior Member, IEEE, and Kang Zhang, Senior Member, IEEE
enough attention in the data mining community. This might be due to the fact that visual representation lacks precision in contrasting clustering results. We have proposed an approach called HOV3 to detect cluster structures [28]. In this paper, we discuss its projection mechanism to support external cluster validation. Our approach is based on the assumption that by using a measure to project the data sets in the same cluster structure, the similarity of their data distributions should be high. By comparing the distributions produced by applying the same measures to a clustered subset and other non-clustered subsets of a database by HOV3, users can investigate the consistency of cluster structures between them both in visual form and in numerical calculation. The rest of this paper is organized as follows. Section 2 briefly introduces ideas of cluster validation (with more details of external cluster validation) and visual cluster validation. A review of related work on cluster validation by visualization, and a more detailed account of HOV3 are presented in Section 3. Section 4 describes our idea on verifying the consistency of cluster structure by a distribution matching based method in HOV3. Section 5 demonstrates the application of our approach on several well-known data sets. Finally, section 6 summarizes the contributions of this paper. II. BACKGROUND A. Cluster Validation Cluster validation is a procedure of assessing the quality of clustering results and finding a cluster strategy fit for a specific application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns. In general, cluster validation approaches are classified into the following three categories [9, 15, 27]. Internal approaches: they assess the clustering results by applying an algorithm with different parameters on a data set for finding the optimal solution [1]; Relative approaches: the idea of relative assessment is based on the evaluation of a clustering structure by comparing it to other clustering schemes [8]; and External approaches: the external assessment of a clustering approach is based on the idea that there exists known priori clustered indices produced by a clustering algorithm, and then assessing the consistency of the clustering structures generated by applying the clustering algorithm to different data sets [12].

AbstractVisualization can be very powerful in revealing cluster structures. However, directly using visualization techniques to verify the validity of clustering results is still a challenge. This is due to the fact that visual representation lacks precision in contrasting clustering results. To remedy this problem, in this paper we propose a novel approach, which employs a visualization technique called HOV3 (Hypothesis Oriented Verification and Validation by Visualization) which offers a tunable measure mechanism to project clustered subsets and non-clustered subsets from a multidimensional space to a 2D plane. By comparing the data distributions of the subsets, users not only have an intuitive visual evaluation but also have a precise evaluation on the consistency of cluster structure by calculating geometrical information of their data distributions.

HE goal of clustering is to distinguish objects into partitions/clusters based on given criteria. A large number of clustering algorithms have been developed for different application purposes [8, 14, 15]. However, due to the memory limitation of computers and the extremely large sized databases, in practice, it is infeasible to cluster entire data sets at the same time. Thus, applying clustering algorithms to sampling data to extract hidden patterns is a commonly used approach in data mining [5]. As a consequence of sampling data cluster analysis, the goal of external cluster validation is to evaluate a well-suited cluster scheme learnt from a subset of a database to see whether it is suitable for other subsets in the database. In real applications, achieving this task is still a challenge. This is not only due to the high computational cost of statistical methods for assessing the robustness of cluster structures between the subsets of a large database, but also due the non-linear time complexity of most existing clustering algorithms. Visualization provides users an intuitive interpretation of cluster structures. It has been shown that visualization allows for verification of the clustering results [10]. However, the direct use of visualization techniques to evaluate the quality of clustering results has not attracted
Manuscript received October 31, 2006. K-B. Zhang is with the Department of Computing, Macquarie University, Sydney, NSW 2109, Australia (phone: 612-9850-9590; fax: 612-9850-9551; e-mail: kebing@ics.mq.edu.au). M.A. Orgun is with the Department of Computing, Macquarie University, Sydney, NSW 2109, Australia (e-mail: mehmet@ics.mq.edu.au). K. Zhang is with the Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688, USA (e-mail: kzhang@utdallas.edu).

I. INTRODUCTION

1-4244-0705-2/07/$20.00 2007 IEEE

576

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)

B. External Cluster Validation As a necessary post-processing step, external cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, compare it with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Fig. 1.

Fig. 1. External criteria based validation

The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [24], Jaccard Coefficient [7], Folkes and Mallows index [21], Huberts statistic and Normalized statistic [27], and Monte Carlo method [20], to measure the similarity between the priori modeled partitions and clustering results of a dataset. However, achieving these tasks is time consuming when the database is large, due to the drawback of high computational cost of statistics-based methods for assessing the consistency of cluster structure between the sampling subsets. Recent surveys on cluster validation methods can be found in the literatures [10, 12, 27]. C. Visual Cluster Validation In high dimensional space, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well anymore [3]. Thus, introducing visualization techniques to explore and understand high-dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [23]. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [26]. Visual cluster validation is a combination of information visualization and cluster validation techniques. In the cluster analysis process, visualization provides analysts with intuitive feedback on data distribution and supports decision-making activities. III. RELATED WORK A. Previous Works A large number of clustering algorithms have been developed, but only a small number of cluster visualization

tools are available to facilitate researchers understanding of the clustering results [25]. Several efforts have been made in cluster validation with visualization [2, 4, 11, 13, 16, 18]. While, these techniques tend to help users have intuitive comparisons and better understanding of cluster structures, but they do not focus on assessing the quality of clusters. For example, OPTICS [2] uses a density-based technique to detect cluster structures and visualizes them in Gaussian bumps, but its non-linear time complexity makes it neither suitable to deal with very large data sets, nor suitable to provide the contrast between clustering results. Kaski el. al [18] imposes the technique of Self-organizing maps (SOM) technique to project multidimensional data sets on a 2D space for matching visual models [17]. However, the SOM technique is based on a single projection strategy and not powerful enough to discover all the interesting features from the original data. Huang et. al [11, 13] proposed approaches based on FastMap [5] to assist users on identifying and verifying the validity of clusters in visual form. Their techniques are good on cluster identification, but are not able to evaluate the cluster quality very well. The most prominent feature of techniques based on Star Coordinates, such as VISTA [4] and HOV3 [28], is their linear time computational complexity. This feature makes them suitable to be used as visual interpretation and detection tools in cluster analysis. However, the characteristic of imprecise qualitative analysis of Star Coordinates and VISTA limits them as quantitative analysis tools. In addition, VISTA adopts landmark points as representatives from a clustered subset and re-samples them to deal with cluster validation [4]. But its experience-based landmark point selection does not always handle the scalability of data very well, due to the fact that wellrepresentative landmark points selected in a subset may fail in other subsets of a database. Visualization techniques used in data mining and cluster analysis are surveyed in the literatures [22, 25]. B. Star Coordinates The approach HOV3 employed in this research was inspired from the Star Coordinates [16]. For better understanding our work, we briefly describe it here. Star Coordinates utilizes a point on a 2D surface to represent a set of points of n-dimensional data. The values of n-dimensional coordinates are mapped to the orthogonal coordinates X and Y, as shown in Fig. 2.

Fig. 2. Positioning a point by an 8-attribute vector in Star Coordinates [16]

577

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)

The mapping from n-dimensional Star Coordinates to 2D X-Y coordinates is calculated as in Formula (1). (1) where pj (x, y) is the normalized location of Dj=(dj1, dj2, ., djm), and dji is the value of jth record of a data set on the ith coordinate Ci in Star Coordinates space; xi(dji-mini) and yi(dji-mini) are unit vectors of dji mapping to X direction and Y direction, mini=min(dji,0 j<m) maxi=max (dji, 0 j<m) are minimum and maximum values of ith dimension respectively; and m is the number of records in the data set. C. HOV3 Model The idea of HOV3 is based on hypothesis test by visualization. It treats hypotheses as measures to reveal the difference between hypotheses and real performance by projecting the test data against the measures [28]. Geometrically, the difference of a matrix Dj and a vector M can be represented by their inner product, Dj M. Let Dj=(dj1, dj2,, djm) be a data set with n attributes, and M=(m1, m2,,mn). The inner product of each vector dji, (i =1, , n) of Dj with M can be seen as a mapping from an ndimensional data set to one measure F: Rn R2. It is written as:
< d ji , M >= m 1 d j1 + m 2 d j2 + .... + m n d jn =
n k =1

measure vector. Thus HOV3 is also able to detect the consistency of cluster structures among the subsets of a database by comparing their data distributions, because cluster validation procedure is primarily a hypothesis test process. D. The Axis Tuning Feature Overlapping and ambiguities are inevitably introduced by projecting multidimensional data into 2D space. For mitigating the problem, Star Coordinates provides several visual adjustment mechanisms, such as axis scaling, axes angles rotation; coloring data points, etc [15]. We use Iris, a well-known data set in machine learning research, as an example to demonstrate the feature of axis scaling of techniques based on Star Coordinates as follows.

m k d jk

(2)

In order to enlarge the data analysis space, we introduce the complex number system into our study. Let z = x + i.y, where i is the imaginary unit. According to the Euler formula, we have: eix = cosx+isinx. Let z 0 = e 2 i/n ; we see that z01, z02, z03,, z0n-1, z0n (with z0n = 1) divide the unit circle on the complex plane into n-1 equal sectors. Then the formula (1) can be simply written as:
P j (z 0 ) =
n k =1

Fig. 3. The initial data distribution of clusters of Iris produced by kmeans in VISTA.

[(d

k jk min d k / max d k min d k z 0

)(

(3)

Where, min d jk and max d jk represents the minimal and


k k

the maximal values of the kth coordinate respectively. This is the case of equally-divided circle surface. Then the more general form can be defined as:
Pj (z 0 ) =
n k =1

Iris has 4 numeric attributes and 150 instances. We first applied K-means clustering algorithm to it and obtained 3 clusters (k=3,here), and then tuned the weight value of each axis (called -adjustment in VISTA) of Iris in VISTA [4]. Fig.3 shows the original data distribution of Iris, which has overlapping among the clusters. A well-separated distribution of Iris is illustrated in Fig. 4 by a series of axis scaling. The clusters are much easier to recognize in Fig. 4 than those in the original one.

[(d

jk

min d k / (max d k min d k ) z K

(4)

where z k = e ik ;
n

is the angle of neighbouring axes; and

. In any case equation (4) can be viewed as mappings

from R C 2. Given a non-zero measure vector m in Rn, and a family of vectors Pj, and the projections of Pj against m according to formulas (2) and (4), the HOV3 model is given as the following equation:
Pj (z 0 ) =
n k =1

[(d

jk

min d k / (max d k min d k ) z K m k

(5)
Fig. 4. The tuned version of the Iris data distribution in VISTA.

where mk is the kth attribute of measure m. As shown above, a hypothesis in HOV3 is a quantified

578

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)

This axis-tuning feature is significant for our external cluster validation method based on distribution matching by HOV3. We give the detailed explain next. IV. CLUSTER VALIDATION WITH HOV3 The feature of tunable axis provides us a mechanism to quantitatively handle the external cluster validity by HOV3. Our approach is based on the assumption that by using a measure to project the data sets in the same cluster structure, the similarity of their data distributions should be high. Based on this idea, we have implemented an approach for external cluster validation based on distribution matching by HOV3. A. Definitions To explain our approach precisely, we first give a few definitions below. Definition 1: A data projection from n-dimensional space to 2D plane by applying HOV3 to a data set , as shown in formula (5), is denoted as Dp=

C

po Ci | po -p| ), where is the threshold distance given by the user. Definition 6: The overlapping point set of cluster Ci is composed as a quasi-cluster of Ci, denoted as Cqi i.e., {po Cqi | Ci po} All overlapping points of Ci are composed a quasi-cluster Cqi of Ci. Definition 7: A cluster Ci is called a well-separated cluster visually, when it satisfies the condition that (Ci s,
Cj s| pCi: p Cj po i j ). A well-separated cluster Ci in the spy subset implies that no points in Ci are within the threshold distance to any other clusters in the spy subset. Based on above the definitions, we present the application of our approach to external cluster validation based on distribution matching by HOV3 as follows. B. The Stages of Our Approach The stages of the application of our approach are summarized in the following steps:

( , M), where =(p1, p2, ., pm), pj

is an n-dimensional data set, and

(1 k m) is an instance of ; M =(w1t, w2t, ., wnt), is a non-zero measure vector; wit (1 i n) is the weight value of kth coordinate at t moment in the Star Coordinates plane; Dp is the geometrical distribution of in 2D space, Dp =(p1(x1,y1), p2(x2,y2), ..., pm(xm,ym)), pj(xi,yi) is the location of pj in X-Y Coordinates plane. Definition 2: Let be a database of data points. A cluster C :=(D, L) is a non-empty set D on a label set L, and the ith cluster Ci ={pD, lL| Cj.p: Cj.l=i i>0} where l is the cluster label of p, l{-1, 0, 1,k}, and k is the number of clusters. As special case, an outlier point is an element of and with cluster label 1; a non-clustered element of ; has a cluster label of 0, i.e., it has not been clustered. Definition 3: A spy subset s is a clustered subset of produced by a clustering algorithm, where Ck, CE}, Ci (1 i k) is a cluster in
s s; s

1. Clustering First, the user applies a clustering algorithm to a randomly selected subset s from the given dataset . 2. Cluster Separation The clustering result of s is introduced and visualized in HOV3. Then the user manually tunes the weight value of each axis to separate overlapping clusters. If one or more cluster(s) are separated from the others visually, then the weight values of each axis are recorded as a measure vector M. 3. Data Projection by HOV3 The user samples another observation with the number of points as in s as a target subset t. The clustered
subset s (now as a spy subset) and its target subset t are projected together by HOV3 with vector M to detect the distribution consistency between s and t.

={C1,C2,,

CE is the outlier set of

A spy subset is used as a model to verify the cluster structure in the other partitions in the database .

Definition 4: A subset
A target subset
t

is a target subset of
s

s,

4. The Generation of Quasi-Clusters The user gives a threshold , and then according to Definitions 5, 6 and 7, a quasi-cluster Cqi of a separated cluster Ci is computed. Then Cqi is removed from t, and Ci is removed from s. If s has clusters then we go back to step 2, otherwise we proceed to the next step. 5. The Interpretation of Result The overlapping rate of each cluster-and-quasi-cluster pair is calculated as (Cqi, Ci)=|Cqi|/|Ci|. If the overlapping rate approaches 1, cluster Ci and its quasiclusters Cqi have high similarity, since the amount ratio of the spy subset and the target subset is 1:1. Thus the overlapping analysis is simply transformed into a linear regression analysis, i.e., the points around the line C=Cq.

={Pt.p , Pt.lL | Pt.p:Pt.l=0 |

|=| t|}. and has

is a non-clustered subset of

the same size of a spy subset Ps of . It is used as a target to investigate the similarity of cluster structure with the spy subset s. Definition 5: A non-clustered point po is called an overlapping point of a cluster Ci, denoted as Ci po iff ( pCi

579

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)

Corresponding to the procedure mentioned above, we give the algorithm of external cluster validation based on distribution matching by HOV3 below, in Fig. 5.

subsets from the database. To handle the scalability on resampling datasets, we choose the non-cluster observations with the same size as the clustered subset, and then project them together by HOV3. As a consequence, the user can easily utilize the well-separated clusters produced by scaling axes in HOV3 as a model to pick out their corresponding quasi-clusters, where points in a quasi-cluster overlap its corresponding cluster. Also, instead of using statistical methods to assess the similarity between the two subsets, we simply compute the overlapping rate between the clusters and their quasi-clusters to explore their consistency. V. EXAMPLES AND EXPLANATION In this section, we present several examples to demonstrate the advantages of the external cluster validation in HOV3. We have implemented our approach in MATLAB running under Windows 2000 Professional. The datasets used in the examples are obtained from the UCI machine learning website: http://www.ics.uci.edu/~mlearn/MachineLearning. html.

Fig 5. The algorithm of external cluster validation based on distribution matching in HOV3

In Fig. 5, the procedure clusterSeparate responds the users axis tuning to separate the clusters in the spy subset, and to gather weight values of axes as a measure vector; the procedure quasiClusterGeneration produces quasi clusters in the target subset corresponding to the clusters in the spy subset. C. Our Model In contrast to statistics-based external cluster validation model illustrated in Fig. 2, we exhibit our model for external cluster validation by visualization in HOV3 in Fig. 6.
Fig. 7. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV3 (without cluster indices)

Fig. 6. External cluster validation by HOV3

Comparing these two models, we may observe that instead of using a clustering algorithm to cluster another sampling data sets, in our model, we use a clustered subset from a database as a model to verify the similarity of cluster structure between the model and the other non-clustered

Shuttle data set has 9 attributes and 4,3500 instances. We choose the first 5,000 instances of Shuttle as a sampling data and apply the K-means algorithm [19] to it. Then we utilize the clustered result as a spy subset. We assumed that we have found the optimal cluster number k=5 for the sampling data. The original data distributions with and without cluster indices are illustrated in the diagrams of Fig.7 and Fig.8 respectively. It can been seen that there exists a cluster overlapping in Fig. 8. To obtain well-separated clusters, we tuned the weight of each coordinate, and had a satisfied version of the data distribution as shown in Fig. 9. The weight values of axes are recorded as a measure vector [0.80, 0.55, 0.85, 0.0, 0.40, 0.95, 0.20, 0.05, 0.459], in this case. Then we chose the second 5,000 instances of Shuttle as a target subset and projected the target subset and the spy subset together against the measure vector by HOV3.

580

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)

cluster are listed in Table 1, and their curves of linear regression to the line C=Cq are illustrated in Fig. 11.
TABLE I CLUSTERS AND THEIR CORRESPONDING QUASI-CLUSTERS Subset
Spy Target 1 Target 2 Target 3 Target 4

Cq1/C1
318 278/318 =0.8742 279/318 =0.8773 280/318 =0.8805 261/318 =0.8208

Cq2/C2
773 670/773 =0.8668 897/773 =1.1604 875/773 =1.1320 713/773 =0.9224

Cq3/C3
513 503/513 =0.9805 626/513 =1.2203 481/513 =0.9376 368/513 =0.7173

Cq4/C4
2254 2459/2254 =1.0909 2048/2254 =0.9086 2093/2254 =0.9286 2416/2254 =1.0719

Cq5/C5
1142 1123/1142 =0.9834 1602/1142 =1.4028 1455/1142 =1.2741 1169/1142 =1.0264

Fig. 8. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV3 (with cluster indices)

*At current stage, we collect the quasi-clusters manually, thus Cqi here may have redundancy and misloading.

Their distributions are presented in Fig. 10, where we may observe that their data distributions are matched very well. We chose the points in the enclosed area in Fig. 10 as a cluster then obtained a quasi-cluster in the target subset corresponding to the cluster in the enclosed area. In the same way, we can find the other quasi-clusters from the target subset.

It is observed that the curves are well matched to the line C=Cq, i.e. the overlapping rate between the clusters and their quasi-clusters are high. The standard deviation is a good way to reflect the difference between the two vectors. Thus we have calculated the standard deviation of each Cqi-Ci pairs among the targetk (k=1,..4) and the spy subsets. They are 0.0826, 0.1975, 0.1491 and 0.1304. This means that the similarity of cluster structure in the spy and the target subsets is high. In summary, the experiments show that the same cluster structure in the spy subset of Shuttle also exists in the target subsets of Shuttle.

Fig. 9. A well-separated version of the spy subset distribution of Shuttle

Fig. 11. The curves of linear regression to the line C=Cq.

In these experiments, we have also measured the timing for both clustering and projection in MATLAB. The results are listed in the Table 2.

Fig. 10. The projection of the spy subset and a target subset of Shuttle by applying a measure vector.

We have done the same experiment on 4 target subsets of Shuttle. The size of each quasi-cluster and its corresponding

581

Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007)
[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] Jaccard, S. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, 223270. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On Clustering Validation Techniques Journal of Intelligent Information Systems, Volume 17 (2/3), 2001, pp. 107145. M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster validity methods: Part I and II, SIGMOD Record, 31, 2002. Z. Huang, D. W. Cheung and M. K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, Proceedings of DASFAA01, Hong Kong, April 2001, pp.84-91. J. Handl, J. Knowles, and D. B. Kell, Computational cluster validation in post-genomic data analysis, Journal of Bioinformatics Volume 21(15), 2005, pp. 3201-3212. Z. Huang and T. Lin, A visual method of cluster validation with Fastmap, Proc. of PAKDD-2000, 2000 pp. 153- 164. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall,1988 A. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Volume 31(3), 1999, pp. 264-323. E. Kandogan, Visualizing multi-dimensional clusters, trends, and outliers using star coordinates, Proc. of ACM SIGKDD Conference, 2001, pp.107-116. T. Kohonen, Self-Organizing Maps Springer, Berlin, second extended edition,1997. S. Kaski, J. Sinkkonen. and J. Peltonen, Data Visualization and Analysis with Self-Organizing Maps in Learning Metrics, DaWaK 2001, LNCS 2114, 2001, pp.162-173. J. McQueen, Some methods for classification and analysis of multivariate observations, Proc. of 5th Berkeley Symposium on Mathematics, Statistics and Probability, Volume 1, 1967, pp. 281298. G. W. Milligan , A Review Of Monte Carlo Tests Of Cluster Analysis, Journal of Multivariate Behavioral Research Vol. 16( 3), 1981, pp. 379-407. G.W. Milligan, L.M. Sokol, & S.C. Soon The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure, IEEE Trans PAMI, 1983 5(1):40-47. F. Oliveira, H. Levkowitz, From Visual Data Exploration to Visual Data Mining: A Survey, IEEE Trans.Vis.Comput. Graph, Volume 9(3), 2003, pp.378-394. E. Pampalk, W. Goebl, and G. Widmer, Visualizing Changes in the Structure of Data forExploratory Feature Selection, SIGKDD 03, August 24-27, 2003, Washington, DC, USA Rand, W.M., Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc., 66:846-850, 1971. J. Seo and B. Shneiderman, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, Essays Dedicated to Erich J. Neuhold on the Occasion of His 65th Birthday. Lecture Notes in Computer Science Volume 3379, Springer, 2005. B Shneiderman, Inventing Discovery Tools: Combining Information Visualization with Data Mining, Proc. of Discovery Science 2001,Lecture Notes in Computer Science Volume 2226, 2001, pp.1728. S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press. 1999. K-B. Zhang, M. A. Orgun and K. Zhang, HOV3, An Approach for Cluster Analysis, Proc. of ADMA 2006, XiAn, China, Lecture Notes in Computer Science series, Volume. 4093, 2006, pp317-328.

TABLE 2 TIMING OF CLUSTERING AND PROJECTING Clustering by K-mens (k=5) Projecting by HOV3
Subset Target 1 Target 2 Target 3 Target 4 Amount Time (Second) Subset Syp+Target 1 Syp+Target 2 Syp+Target 3 Syp+Target 4 Size Time (Second)

5,000 5,000 5,000 5,000

.532 .61 .656 .453

10,000 10,000 10,000 10,000

.11 .11

.109 .109

Based on this calculation, it has been observed that the projection by HOV3 is much faster than the clustering process by the K-means algorithm. It is particularly effective for verifying the clustering results within extremely huge databases. Although the cluster separation in our approach may incur some time, once the well-separated clusters are found, using a measure vector to project a huge data set will be a lot more efficient than re-applying a clustering algorithm to the data set. VI. CONCLUDING REMARKS In this paper we have proposed a novel visual approach to assist users to verify the validity of any cluster scheme, i.e., an approach based on distribution matching for external cluster validation by visualization. The HOV3 visualization technique has been employed in our approach, which uses measure vectors to project a data set and allows the user to iteratively adjust the measures for optimizing the result of clusters. By comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV3 with tunable measures, users can performance intuitive visual evaluation, and also have a precise evaluation on the consistency of the cluster structure by performing geometrical computation on their data distributions as well. By comparing our approach with existing visual methods, we have observed that our method is not only efficient in performance, but also effective in real applications. REFERENCES
[1] A. L. Abul, R. Alhajj, F. Polat and K. Barker Cluster Validity Analysis Using Subsampling, in proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Washington DC, Oct. 2003 Volume 2: pp. 1435-1440. M. Ankerst, M. M. Breunig, H.-P. Kriegel, J.Sander, OPTICS: Ordering points to identify the clustering structure, in proceedings of ACM SIGMOD Conference, 1999 pp. 49-60. C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, Subspace Selection for Clustering High-Dimensional Data, Proc. of the Fourth IEEE International Conference on Data Mining (ICDM04), 2004, pp.11-18. K. Chen and L. Liu,.VISTA: Validating and Refining Clusters via Visualization, Journal of Information Visualization. Volume3 (4), 2004, pp. 257-270. E. Clifford, Data Analysis by Resampling: Concepts and Applications, Duxbury Press, 2000. C. Faloutsos and K. Lin, Fastmap: a fast algorithm for indexing, datamining and visualization of traditional and multimedia data sets Proc. of ACM-SIGMOD, 1995 pp.163- 174.

[20] [21] [22] [23] [24] [25]

[26]

[27] [28]

[2] [3]

[4] [5] [6]

582

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis*


Ke-Bing Zhang1, Mehmet A. Orgun1, and Kang Zhang2
1

Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {kebing, mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas Richardson, TX 75083-0688, USA kzhang@utdallas.edu

Abstract. The goal of clustering in data mining is to distinguish objects into partitions/clusters based on given criteria. Visualization methods and techniques may provide users an intuitively appealing interpretation of cluster structures. Having good visually separated groups of the studied data is beneficial for detecting cluster information as well as refining the membership formation of clusters. In this paper, we propose a novel visual approach called M-mapping, based on the projection technique of HOV3 to achieve the separation of cluster structures. With M-mapping, users can explore visual cluster clues intuitively and validate clusters effectively by matching the geometrical distributions of clustered and non-clustered subsets produced in HOV3. Keywords: Cluster Analysis, Visual Separability, Visualization.

1 Introduction
Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain knowledge to databases. The applications of clustering algorithms to detect grouping information in real world applications are still a challenge, primarily due to the inefficiency of most existing clustering algorithms on coping with arbitrarily shaped distribution data of extremely large and high-dimensional databases. Moreover, the very high computational cost of statistics-based cluster validation methods is another obstacle to effective cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [18]. Nowadays, as an indispensable technique, visualization is involved in almost every step in cluster analysis. However, due to the impreciseness of visualization, it is often used as an observation and rendering tool in cluster analysis, but it has been rarely employed directly in the precise comparison of the clustering results. HOV3 is a visualization technique based on hypothesis testing [20]. In HOV3, each hypothesis is quantified as a measure vector, which is used to project a data set for
*

The datasets used in this paper are available from http://www.ics.uci.edu/~mlearn/Machine-Learning.html

G. Qiu et al. (Eds.): VISUAL 2007, LNCS 4781, pp. 288300, 2007. Springer-Verlag Berlin Heidelberg 2007

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis

289

investigating cluster distribution. The projection of HOV3 is also proposed to deal with cluster validation [21]. In this paper, in order to gain an enhanced visual separation of groups, we develop the projection of HOV3 into a technique which we call M-mapping, i.e., projecting a data set against a series of measure vectors. We structure the rest of this paper as follows. Section 2 briefly introduces the current issues of cluster analysis. It also briefly reviews the efforts that have been done in the visual cluster analysis and discusses the projection of HOV3 as the background of this research. Section 3 discusses the M-mapping model and its several important features. Section 4 demonstrates the effectiveness of the enhanced separation feature of M-mapping on cluster exploration and validation. Finally, section 5 concludes the paper with a brief summary of our contributions.

2 Background
2.1 Cluster Analysis Cluster analysis includes two processes: clustering and cluster validation. Clustering aims to distinguish objects into partitions, called clusters, by a given criteria. The objects in the same cluster have a higher similarity than those between the clusters. Many clustering algorithms have been proposed for different purposes in data mining [8, 11]. Cluster validation is regarded as the procedure of assessing the quality of clustering results and finding a fit cluster scheme for a specific application at hand. Since different cluster results may be obtained by applying different clustering algorithms to the same data set, or even, by applying a clustering algorithm with different parameters to the same data set, cluster validation plays the critical role in cluster analysis. However, in practice, it may not always be possible to cluster huge datasets by using clustering algorithms successfully. As Abul et al pointed out In high dimensional space, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well anymore[1]. In addition, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation. 2.2 Visual Cluster Analysis The users correct estimation of the cluster number is important for choosing the parameters of clustering algorithms in the pre-processing stage of clustering, as well as assessing the quality of clustering results in the post-processing stage of clustering. The success of these tasks heavily relies on the users visual perception of the distribution of a given data set. It has been observed that visualization of a data set is crucial in the verification of the clustering results [6]. Visual cluster analysis enhances cluster analysis by combining it with visualization techniques. Visualization techniques are typically employed as an observational mechanism to understand the studied data. Therefore, instead of contrasting the quality of clustering results, most of the visualization techniques used in cluster analysis focus on assisting users in having an easy and intuitive understanding of the cluster structure in the data. Visualization has been shown to be an intuitive and effective method used in the exploration and verification of cluster analysis.

290

K.-B. Zhang, M.A. Orgun, and K. Zhang

Several efforts have been made in the area of cluster analysis with visualization: OPTICS [2] uses a density-based technique to detect cluster structures and visualizes clusters in Gaussian bumps, but it has non-linear time complexity making it unsuitable to deal with very large data sets or to provide the contrast between clustering results. H-BLOB [16] visualizes clusters into blob manners in a 3D hierarchical structure. It is an intuitive cluster rendering technique, but its 3D and two stages expression limits its capability in an interactive investigation of cluster structures. Kaski el. al [15] uses Self-organizing maps (SOM) to project high-dimensional data sets to 2D space for matching visual models [14]. However, the SOM technique is based on a single projection strategy and it is not powerful enough to discover all the interesting features from the original data. Huang et. al [7, 10] proposed several approaches based on FastMap [4] to assist users on identifying and verifying the validity of clusters in visual form. Their techniques are good on cluster identification, but not able to deal with the evaluation of cluster quality very well. Moreover, the techniques discussed here are not very well suited to the interactive investigation of data distributions of high-dimensional data sets. A recent survey of visualization techniques in cluster analysis can be found in the literature [19]. Interactive visualization is useful for the user to import his/her domain knowledge into the cluster exploration stage on the observation of data distribution changes. Star Coordinates favors to do so with its interactive adjustment features [21]. The M-mapping approach discussed in this paper has been developed based on the Star Coordinates and the projection of HOV3 [20]. For a better understanding of the work in this paper, we briefly describe them next. 2.3 The Star Coordinates Technique The Star Coordinates is a technique for mapping high-dimensional data to 2D dimensions. It plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space. First, data on each dimension are normalized into [0, 1] or [-1, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the 2D plane. The most prominent feature of Star Coordinates and its extensions such as VISTA [4] and HOV3 [20] is that their computational complexity is only in linear time (that is, every n-dimensional data item is processed only once). Therefore they are very suitable as visual interpretation and exploration tools in cluster analysis. However, it is inevitable to introduce overlapping and bias by mapping high-dimensional data to 2D space. For mitigating the problem, Star Coordinates based techniques provide some visual adjustment mechanisms, such as axis scaling (called -adjustment in VISTA), footprints, rotating axes angles and coloring data points [12]. Axis Scaling The purpose of axis scaling in Star Coordinates is to adjust the weight value of each axis dynamically and observe the changes to the data distribution under the newly weighted axes. We use the Iris, a well-known data set in machine learning area as an example to demonstrate how axis scaling works as follows.

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis

291

Fig. 1. The initial data distribution of clusters of Iris produced by k-means in VISTA

Fig. 2. The tuned version of the Iris data distribution in VISTA

Iris has 4 numeric attributes and 150 instances. We first applied the K-means clustering algorithm to it and obtained 3 clusters (with k=3), and then tuned the weight value of each axis of Iris in VISTA [4]. The diagram in Fig.1 shows the original data distribution of Iris, which has overlapping among the clusters. A well-separated cluster distribution of Iris is illustrated in Fig. 2 by a series of axis scaling. The clusters are much easier to be recognized than those in the original one. Footprints To observe the effect of the changes to the data points under axis scaling, Star Coordinates provides the footprints function to reveal the trace of each point [12]. We use another data set auto-mpg to demonstrate this feature. The data set auto-mpg has 8 attributes and 397 items. Fig. 3 presents the footprints of axis scaling of attributes weight and mpg, where we may find some points with longer traces, and some with shorter footprints. However, the imprecise and random adjustments of Star Coordinates and VISTA limits them to be utilized as quantitative analysis tools. 2.4 The HOV3 Model The HOV3 Model improves the Star Coordibates model. Geometrically, the difference of a matrix Dj (a data set) and a vector M (a measure) can be represented by their inner product, DjM. Based on this idea Zhang et al proposed a projection technique, called HOV3, which generalizes the weight values of axes in Star Coordinates as a hypothesis (measure vector) to reveal the differences between the hypotheses and the real performance [20].

Fig. 3. Footprints of axis scaling of weight and mpg attributes in Star Coordinates [12]

292

K.-B. Zhang, M.A. Orgun, and K. Zhang

The Star Coordinates model can be simply described by the Euler formula. According to the Euler formula: eix = cosx+isinx, where z = x + i.y, and i is the imaginary unit. Let z0=e2i/n; we see that z01, z02, z03, , z0n-1, z0n (with z0n = 1) divide the unit circle on the complex plane into n equal sectors. Thus Star Coordinates mapping can be simply written as:
k Pj (z 0 ) = [(d jk min d k ) / (max d k min d k ) z 0 ] n k =1

(1)

where mind k and maxd k represent the minimal and maximal values of the kth coordinate respectively, and mk is the kth attribute of measure M. In any case n equation (1) can be viewed as mappings from R C2. n Then given a non-zero measure vector M in R and a family of vectors Pj, the projection of Pj against M according to formula (1), in the HOV3 model [18], is given as:
k Pj (z 0 ) = [(d jk min d k ) / (max d k min d k ) z 0 m k ] n k =1

(2)

As shown above, a hypothesis in HOV3 is a quantified measure vector. HOV3 not only inherits the axis scaling feature of Star Coordinates, but also generalizes the axis scaling as a quantified measurement. The processes of cluster detection and cluster validation can be tackled with HOV3 based on its quantified measurement feature [20, 21]. To improve the efficiency and effectiveness of HOV3, we develop the projection technique of HOV3 further with M-mapping.

3 M-Mapping
3.1 The M-Mapping Model

It is not easy to synthesize hypotheses into one vector. In practice, rather than using a single measure to implement a hypothesis test, it is more feasible to investigate the synthetic response of applying several hypotheses/predictions together to a data set. For simplifying the discussion of the M-mapping model, we give a definition first.
Definition 1 (Poly-multiply vectors to a matrix). The inner product of multiplying a series of non-zero measure vectors M1, M2,,Ms to a matrix A is denoted as

A* M i =AM1M2.Ms.
i =1

A simple notation of HOV3 projection as p=C (, M) was given by Zhang et al [20], where is a data set; p is the data distribution of by applying a measure vector M. Then the projection of M-mapping is denoted as p=C (, M i ). Based
i=1 s

on equation (2), M-mapping is formulated as follows:


k Pj (z 0 ) = [(d jk min d k ) / (max d k min d k ) z 0 m ik ] n s k =1 i =1

(3)

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis

293

where mik is the kth attribute (dimension) of the ith measure vector Mi, and s1. When s=1, the equation (3) is transformed into equation (2) (the HOV3 model). We may observe that instead of using a single multiplication of mk in formula (2), it is replaced by a poly-multiplication of m ik in equation (3). Equation (3) is more
i =1 s

general and also closer to the real procedure of cluster detection. It introduces several aspects of domain knowledge together into the process of cluster detection by HOV3. Geometrically, the data projection by M-mapping is the synthesized effect of applying each measure vector by HOV3. In addition, the effect of applying M-mapping to datasets with the same measure vector can enhance the separation of grouped data points under certain conditions. We describe this enhanced the separation feature of M-mapping below.
3.2 The Features of M-Mapping

For the explanation of the geometrical meaning of the M-mapping projection, we use the real number system. According to the equation (2), the general form of the distance (i.e., weighted Minkowski distance) between two points a and b in HOV3 plane can be represented as:

(a , b, m) = q | m k (a k b k ) |
k =1

(q>0) *

(4)

If q = 1, is Manhattan (city block) distance; and if q = 2, is Euclidean distance. To simplify the discussion of our idea, we adopt the Manhattan metric. Note that there exists an equivalent mapping (bijection) of distance calculation between the Manhattan and Euclidean metrics [13]. For example, if the distance ab between points a and b is longer than the distance ab between points a and b in the Manhattan metric, it is also true in the Euclidean metric, and vice versa. In Fig 4, the orthogonal lines represent the Manhattan distance and the diagonal lines are the Euclidean distance (red for ab and blue for ab) respectively. Then the Manhattan distance between points a and b is calculated as in formula (5).

Fig. 4. The distance representation in Manhattan and Euclidean metrics

( a , b , m ) = m k (a k b k )
k =1

(5)

According to the equations (2), (3) and (5), we can present the distance of M-mapping in Manhattan distance as follows:

( a , b, m i ) = | m k i ( a k b k ) |
i =1 k =1 i = 1

(6)

294

K.-B. Zhang, M.A. Orgun, and K. Zhang

Definition 2 (The distance representation of M-mapping). The distance between two

data points a and b projected by M-mapping is denoted as M ab . If the measure i =1


vectors in an M-mapping are the same, M ab can be simply written as M ab; if i =1 each attribute of M is 1 (no measure case), the distance between points a and b is denoted as ab. For example, the distance between two points a and b projected by M-mapping with the same two measures can be represented as M2ab. Thus the projection of HOV3 of a and b can be written as Mab.
Contracting Feature From the equations (5) and (6), we may observe that the application of M-mapping to a data set is a contracting process of data distribution of the data set. This is bes

cause, when |mki|<1 and ab0, we have | m k (a k b k ) | < | (a k b k ) |


k =1 k =1

(a, b, m)< ab. In the same way, we have (a, b, m2)< (a, b, m) and (a, b, mn+1)< (a, b, mn), n. Hinneburg at el proved that a contracting projection of a data set could strictly preserve the density of the dataset [9]. Chen and Liu also proved that in the Star Coordinates 2D space, the original closed data points are also more closed relatively in the newly produced data distribution by axis scaling [4]. Thus, the relative geometrical position of data points within a cluster in a data set would be closer by applying M-mapping to the data set.
Enhanced Separation Feature

If the measure vector is changed from M to M, and |Mab -Mac | <| Mab -Mac| then

| M 'ab M 'ac | | M' 2 ab M ' 2 ac | | Mab Mac | | M 'ab M 'ac | . > | M'ab M 'ac | | Mab Mac |
Due to the space limitation, the detailed proof of this property can be found in [22]. This inequality shows that if the difference of the distance ab and the distance ac are increased by scaling axes from M to M (which can be observed by the footprints of points a, b and c, as shown in Fig 3), then after applying M-mapping to a, b and c, the distance variation rate of distances ab and ac would be enhanced. In other words, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV3 to a data set, then the application of M-mapping with the measure vector to the data set would lead to the groups being more contracted, i.e., they will be a good separation of the groups. These two features of M-mapping are significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of M-mapping keeps the data points within a

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis

295

cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-mapping can extend the distance of far data points relatively further.
Improving Accuracy of Data Point Selection with Zooming External cluster validation [19] refers to the comparison of previously produced cluster patterns with newly produced cluster patterns to evaluate the genuine cluster structure about a data set. However, due to the very high computational cost of statistical methods on assessing the consistency of cluster structures between the subsets of a large database, achieving this task is still a challenge. Let us assume that, if two sampling subsets of a dataset have similar data distributions and a measure vector is applied to them by HOV3, the similarity of their data distributions should still be high. Based on this assumption, Zhang et al [21] proposed a visual external cluster validation approach with HOV3. Their approach uses a clustered subset from a database and a same-sized unclustered subset as an observation. It then applies several measure vectors that can separate clusters in the clustered subset. Thus each cluster and its geometrically covered data points (called quasi-cluster in their approach) based on a given threshold distance are selected. Finally, the overlapping rate of each cluster and quasi-cluster pair is calculated; and if the overlapping rate approaches 1, this means that the two subsets have a similar cluster distribution. Compared to statistics-based external validation methods, their method is not only visually intuitive, but also more effective in real applications [21]. However, separating a cluster from lots of overlapping points manually is often time consuming. We claim that the enhanced separation feature of HOV3 can provide improvements not only in efficiency but also in accuracy in dealing with external cluster validation by the proposed approach [21]. As mentioned above, the application of M-mapping to a data set is a contracting process. In order to avoid the contracting effect causing pseudo data points being selected, we introduce a zooming feature with M-mapping. According to equation (2), zooming in HOV3 can be understood as projecting a data set with a vector, which has the same attribute values, i.e., each mk in equation (2) has the same value. Then we choose min(mk)-1 as the zooming vector values, where min(mk) is the non-zero minimal value of mk. Thus the scale of patterns in HOV3 is amplified by applying the combination of M-mapping and zooming. This combination is formalized in equation (7).
k Pj (z0 ) = [(d jk min d k ) /(max d k min d k ) z0 (m s k min(mk ) s )] k =1 n

(7)

Because |mk|<1, and min(mk) is the non-zero minimal value of mk in a measure vector, thus |(mk)Smin(mk)-S|>1 if there exists |mk|>|min(mk)|. With the effect of |(mk)Smin(mk)-S|, M-mapping enlarges the scale of data distributions projected by HOV3. With the same threshold distance of the data selection proposed by Zhang et al [21], M-mapping with zooming can improve the precision of geometrically covered data point selection.

296

K.-B. Zhang, M.A. Orgun, and K. Zhang

4 Examples and Explanation


In this section we present several examples to demonstrate the efficiency and the effectiveness of M-mapping in cluster analysis.
4.1 Cluster Exploration with M-Mapping

Choosing the appropriate cluster number of an unknown data set is meaningful in the pre-clustering stage. The enhanced separation feature of M-mapping is advantageous in the identification of the cluster number in this stage. We demonstrate this advantage of M-mapping by the following examples.
Wine Data The Wine data set (Wine in short) has 13 attributes and 178 records. The original data distribution of Wine data in 2D space is shown in Fig. 5a, where no grouping information can be observed. Then we tuned axes weight values randomly and had a

(a) The original data distribution of Wine (no measure case)

(b) The data distribution of Wine after tuning axes weight values randomly

(c) p2=C (Wine, M*M)

(d) p2 colored by cluster indices of K-means (k=3)

Fig. 5. Distributions of Wine data produced by HOV3 in MATLAB

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis

297

roughly separated data distribution (what looks like two groups) of Wine, as demonstrated in Fig 5b; we recorded the axes values of Wine as M = [-0.44458, 0.028484, -0.23029, -0.020356, -0.087636, 0.015982, 0.17392, 0.21283, 0.11461,0.099163, -0.19181, 0.34533, 0.27328]. Then we employed M2 (inner dot) as a measure vector and applied it to Wine. The newly projected distribution p2 of Wine is presented in Fig 5c. It has become easier to identify 3 groups of Wine in Fig 5c. Thus, we colored the Wine data with cluster indices that were produced by the K-means clustering algorithm with k=3. The colored data distribution p2 of Wine is illustrated in Fig 5d. To demonstrate the effectiveness of the enhanced separation feature of M-mapping, we contrast the statistics of p2 of Wine (clustered by their distribution, as shown in Fig 5c) and the clustering result of Wine by K-means (k=3). The result is shown in Table 1, where left side of Table1 (CH) is the statistics of clustering result of Wine by M-mapping, and the right side of Table1 (CK) is the clustering result by K-means (k=3). By comparing the statistics of these two clustering results, we may observe that the quality of clustering result by the distribution of p2 of Wine is slightly better than that produced by K-means according to their variance of clustering results. Observing the colored data distribution in Fig 5d carefully, we may find that there is a green point grouped in the brown group by K-means.
Table 1. The statistics of the clusters in wine data produced by M-mapping in HOV3 and K-means
CH 1 2 3 Items 48 71 59 % 26.966 39.888 33.146 Radius 102.286 97.221 108.289 Variance 0.125 0.182 0.124 MaxDis 102.523 97.455 108.497 Ck 1 2 3 Items 48 71 59 % 27.528 39.326 33.146 Radius 102.008 97.344 108.289 Variance 0.126 0.184 0.124 MaxDis 102.242 97.579 108.497

By analyzing the data of these 3 groups, we have found that, group 1 contains 48 items and with Alcohol value 3; group 2 has 71 instances and with Alcohol value 2; and group 3 includes 59 records with Alcohol value 1.
Boston Housing Data The Boston Housing data set (simply written as Housing) has 14 attributes and 506 instances. The original data distribution of Housing is given in Fig. 6a. As in the above example, based on observation and axis scaling we had a roughly separated data distribution of Housing, as demonstrated in Fig 6b; we fixed the weight values of each axis as M = [0.5, 1, 0, 0.95, 0.3, 1, 0.5, 0.5, 0.8, 0.75, 0.25, 0.55, 0.45, 0.75]. By comparing diagrams of Fig. 6a and Fig. 6b, we can see that the data points in Fig. 6b are constricted to possibly 3 or 4 groups. Then M-mapping was applied to Housing. Fig.6c and Fig. 6d show the results of M-mapping with M.*M and M.*M.* M correspondingly. It is much easier to observe the grouping insight from Fig. 6c and Fig. 6d, where we can identify the group members easily. We believe that with the domain experts getting involved in the process, the M-mapping approach can perform better in real world applications of cluster analysis.

298

K.-B. Zhang, M.A. Orgun, and K. Zhang

(a) The original data distribution of Housing

(b) p1=C (Housing, M)

(c) p2=C (Housing, M*M)

(d) p3==C (, M*M*M)

Fig. 6. The enhanced separation of the data set Housing

4.2 Cluster Validation with M-Mapping

We may observe that the data distributions in Fig 6c and Fig 6d are more contracted than the data distributions in Fig 6a and Fig 6b. To ensure that this contracting

(a) p2=C (Housing, M*M)

(b) p2=C (Housing, M*M*V)

Fig. 7. The distributions produced by M-mapping and M-mapping with Zooming

Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis

299

process does not affect data selection, we introduce zooming in the M-mapping process. For example, in the last example, the non-zero minimal value of the measure vector M is 0.25. We then use V = [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] (4=1/0.25) as the zooming vector. We discuss the application of M-mapping with zooming below. It can be observed that the shape of the patterns in Fig.7a is exactly the same as that in Fig. 7b, but the scale in Fig. 7b is enlarged. Thus the effect of combining M-mapping and zooming would improve the accuracy of data selection in external cluster validation by HOV3 [21].

5 Conclusions
In this paper we have proposed a visual approach, called M-mapping, to aid users to enhance the separation and the contraction of data groups/clusters in cluster detection and cluster validation. We have also shown that, based on the observation of data footprints, users can trace grouping clues, and then by applying the M-mapping technique to the data set, they can enhance the separation and the contraction of the potential data groups, and therefore find useful grouping information effectively. With the advantage of the enhanced separation and contraction features of M-mapping, users can identify the cluster number in the pre-processing stage of clustering efficiently, and they can also verify the membership formation of data points among the clusters effectively in the post-processing stage of clustering by M-mapping with zooming.

References
1. Abul, A.L., Alhajj, R., Polat, F., Barker, K.: Cluster Validity Analysis Using Subsampling. In: proceedings of IEEE International Conference on Systems, Man, and Cybernetics, vol. 2, pp. 14351440. Washington DC (October 2003) 2. Ankerst M., Breunig MM., Kriegel, Sander HP. J.: OPTICS: Ordering points to identify the clustering structure. Proc. of ACM SIGMOD Conference (1999) 49-60 3. Baumgartner, C., Plant, C., Railing, K., Kriegel, H.-P., Kroger, P.: Subspace Selection for Clustering High-Dimensional Data. In: Perner, P. (ed.) ICDM 2004. LNCS (LNAI), vol. 3275, pp. 1118. Springer, Heidelberg (2004) 4. Chen, K., Liu, L.: VISTA: Validating and Refining Clusters via Visualization. Journal of Information Visualization 13(4), 257270 (2004) 5. Faloutsos, C., Lin, K.: Fastmap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia data sets. In: Proc. of ACM-SIGMOD, pp. 163174 (1995) 6. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: Part I and II, SIGMOD Record, 31 (2002) 7. Huang, Z., Cheung, D.W., Ng, M.K.: An Empirical Study on the Visual Cluster Validation Method with Fastmap. In: Proc. of DASFAA01, pp. 8491 (2001) 8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001) 9. Hinneburg, K.A., Keim, D.A., Wawryniuk, M.: Hd-eye: Visual mining of highdimensional data. Computer Graphics & Applications Journal~19(5), 22--31 (1999)

300

K.-B. Zhang, M.A. Orgun, and K. Zhang

10. Huang, Z., Lin, T.: A visual method of cluster validation with Fastmap. In: Proc. of PAKDD-2000 pp. 153164 (2000) 11. Jain, A., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264323 (1999) 12. Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proc. of ACM SIGKDD Conference, pp. 107116 (2001) 13. Kominek, J., Black, A.W.: Measuring Unsupervised Acoustic Clustering through Phoneme Pair Merge-and-Split Tests. In: 9th European Conference on Speech Communication and Technology (Interspeech2005), Lisbon, Portugal, pp. 689692 (2005) 14. Kohonen, T.: Self-Organizing Maps, 2nd edn. Springer, Berlin (1997) 15. Kaski, S., Sinkkonen, J., Peltonen, J.: Data Visualization and Analysis with SelfOrganizing Maps in Learning Metrics. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 162173. Springer, Heidelberg (2001) 16. Sprenger, T.C, Brunella, R., Gross, M.H.: H-BLOB: A Hierarchical Visual Clustering Method Using Implicit Surfaces. In: Proc. of the conference on Visualization 2000, pp. 6168. IEEE Computer Society Press, Los Alamitos (2000) 17. Shneiderman, B.: Inventing Discovery Tools: Combining Information Visualization with Data Mining. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 1728. Springer, Heidelberg (2001) 18. Seo, J., Shneiderman, B.: In: Hemmje, M., Niedere, C., Risse, T. (eds.): From Integrated Publication and Information Systems to Information and Knowledge Environments. LNCS, vol. 3379, Springer, Heidelberg (2005) 19. Vilalta, R., Stepinski, T., Achari, M.: An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) 20. Zhang, K-B., Orgun, M.A., Zhang, K.: HOV3, An Approach for Cluster Analysis. In: Li, X., Zaane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 317328. Springer, Heidelberg (2006) 21. Zhang, K-B., Orgun, M.A., Zhang, K.: A Visual Approach for External Cluster Validation. In: CIDM2007. Proc. of the first IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, Hawaii, pp. 577582. IEEE Computer Press, Los Alamitos (2007) 22. Zhang, K-B., Orgun, M.A., Zhang, K.: A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV3, ECML/PKDD 2007, Warsaw, Poland, September 17-21 pp. 336349 (2007)

A Prediction-Based Visual Approach for Cluster Exploration and Cluster Validation by HOV3*
Ke-Bing Zhang1, Mehmet A. Orgun1, and Kang Zhang2
1

Department of Computing, ICS, Macquarie University, Sydney, NSW 2109, Australia {kebing,mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas Richardson, TX 75083-0688, USA kzhang@utdallas.edu

Abstract. Predictive knowledge discovery is an important knowledge acquisition method. It is also used in the clustering process of data mining. Visualization is very helpful for high dimensional data analysis, but not precise and this limits its usability in quantitative cluster analysis. In this paper, we adopt a visual technique called HOV3 to explore and verify clustering results with quantified measurements. With the quantified contrast between grouped data distributions produced by HOV3, users can detect clusters and verify their validity efficiently. Keywords: predictive knowledge discovery, visualization, cluster analysis.

1 Introduction
Predictive knowledge discovery utilizes the existing knowledge to deduce, reason and establish predictions, and verify the validity of the predictions. By the validation processing, the knowledge may be revised and enriched with new knowledge [20]. The methodology of predictive knowledge discovery is also used in the clustering process [3]. Clustering is regarded as an unsupervised learning process to find group patterns within datasets. It is a widely applied technique in data mining. To achieve different application purposes, a large number of clustering algorithms have been developed [3, 9]. However, most existing clustering algorithms cannot handle arbitrarily shaped data distributions within extremely large and high-dimensional databases very well. The very high computational cost of statistics-based cluster validation methods in cluster analysis also prevents clustering algorithms from being used in practice. Visualization is very powerful and effective in revealing trends, highlighting outliers, showing clusters, and exposing gaps in high-dimensional data analysis [19]. Many studies have been proposed to visualize the cluster structure of databases [15, 19]. However, most of them focus on information rendering, rather than investigating on how data behavior changes with the parameters variation of the algorithms.
*

The datasets used in this paper are available from http://www.ics.uci.edu/~mlearn/MachineLearning.html.

J.N. Kok et al. (Eds.): PKDD 2007, LNAI 4702, pp. 336349, 2007. Springer-Verlag Berlin Heidelberg 2007

A Prediction-Based Visual Approach

337

In this paper we adopt HOV3 (Hypothesis Oriented Verification and Validation by Visualization) to project high dimensional data onto a 2D complex space [22]. By applying predictive measures (quantified domain knowledge) to the studied data, users can detect grouping information precisely, and employ the clustered patterns as predictive classes to verify the consistency between the clustered subset and unclustered subsets. The rest of this paper is organized as follows. Section 2 briefly introduces the current issues of cluster analysis, and the HOV3 technique as the background of this research. Section 3 presents our prediction-based visual cluster analysis approach with examples to demonstrate its effectiveness on cluster exploration and cluster validation. A short review of the related work in visual cluster analysis is provided in Section 4. Finally, Section 5 summarizes the contributions of this paper.

2 Background
The approach reported in this paper has been developed based on the projection of HOV3 [22], which was inspired from the Star Coordinates technique. For a better understanding of our work, we briefly describe Star Coordinates and HOV3. 2.1 Visual Cluster Analysis Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at identifying objects into groups, named clusters, where the similarity of objects is high within clusters and low between clusters. Hundreds of clustering algorithms have been proposed [3, 9]. Since there are no general-purpose clustering algorithms that fit all kinds of applications, the evaluation of the quality of clustering results becomes the critical issue of cluster analysis, i.e., cluster validation. Cluster validation aims to assess the quality of clustering results and find a fit cluster scheme for a given specific application. The users initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the users clear understanding on cluster distribution is helpful for assessing the quality of clustering results in the post-processing of clustering. The users visual perception of the data distribution plays a critical role in these processing stages. Using visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [16]. Visual cluster analysis is a combination of visualization and cluster analysis. As an indispensable aid for human-participation, visualization is involved in almost every step of cluster analysis. Many studies have been performed on high dimensional data visualization [2, 15], but most of them do not visualize clusters well in high dimensional and very large data. Section 4 discusses several studies that have focused on visual cluster analysis [1, 7, 8, 10, 13, 14, 17, 18] as the related work of this research. Star Coordinates is a good choice for visual cluster analysis with its interactive adjustment features [11].

338

K.-B. Zhang, M.A. Orgun, and K. Zhang

2.2 Star Coordinates The idea of Star Coordinates technique is intuitive, which extends the perspective of traditional orthogonal X-Y 2D and X-Y-Z 3D coordinates technique to a higher dimensional space [11]. Technically, Star Coordinates plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space. First, data in each dimension are normalized into [0, 1] or [-1, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the X-Y 2D plane. Fig.1 illustrates the mapping from 8 Star Coordinates to X-Y coordinates. In practice, projecting high dimensional data onto 2D space inevitably introduces overlapping and ambiguities, even bias. To mitigate the problem, Star Coordinates and its extension iVIBRATE [4] provide several visual adjustment mechanisms, such as axis scaling, axis angle rotating, data point filtering, etc. to change the data distribution of a dataset interactively in order to detect cluster characteristics and render clustering results effectively. Below we briefly intro- Fig. 1. Positioning a point by duce the two relevant adjustment features with this an 8-attribute vector in Star Coordinates [11] research. Axis scaling The purpose of the axis scaling in Star Coordinates (called a-adjustment in iVIBRATE) is to interactively adjust the weight value of each axis so that users can observe the data distribution changes dynamically. For example, the diagram in Fig.2 shows the original data distribution of Iris (Iris has 4 numeric attributes and 150 instances) with the clustering indices produced by the K-means clustering algorithm in iVIBRATE, where clusters overlap (here k=3). A well-separated cluster distribution of Iris is illustrated in Fig. 3 by a series of random a-adjustments, where clusters are much easier to be recognized than those of the original distribution in Fig 2. For tracing data points changing in a certain period time, the footprint function is provided by Star Coordinates. It is discussed below.

Fig. 2. The initial data distribution of clusters of Iris produced by k-means in iVIBRATE

Fig. 3. The separated version of the Iris data distribution in iVIBRATE

A Prediction-Based Visual Approach

339

Footprint We use another data set auto-mpg to demonstrate the footprint feature. The data set auto-mpg has 8 attributes and 398 items. Fig. 4 presents the footprints of axis tuning of attributes weight and mpg, where we may find some points with longer traces, and some with shorter footprints. The most prominent feature of Star Coordinates and its extensions such as iVIBRATE is that their computational complexity is only in linear time. This makes them very suitable to be employed as a visual tool for interactive interpretation and exploration in cluster analysis.

Fig. 4. Footprints of axis scaling of weight and mpg attributes in Star Coordinates [11]

However, the cluster exploration and refinement based on the users intuition inevitably introduces randomness and subjectiveness into visual cluster analysis, and as a result, sometimes the adjustments of Star Coordinates and iVIBRATE could be arbitrary and time consuming. 2.3 HOV3 In fact, the Star Coordinates model can be mathematically depicted by the Euler formula. According to the Eular formula: eix = cosx+isinx, where z = x + i.y, and i is the imaginary unit. Let z0=e2pi/n; such that z01, z02, z03,, z0n-1, z0n (with z0n = 1) divide the unit circle on the complex 2D plane into n equal sectors. Thus, Star Coordinates can be simply written as:
k Pj (z 0 ) = [(d jk min d k ) / (max d k min d k ) z 0 ] n k =1

(1)

where mindk and maxdk represent the minimal and maximal values of the kth coordinate n respectively. In any case equation (1) can be viewed as mapping from 2. To overcome the arbitrary and random adjustments of Star Coordinates and iVIBRATE, Zhang et al proposed a hypothesis-oriented visual approach called HOV3 to detect clusters [22]. The idea of HOV3 is that, in analytical geometry, the difference of a data set (a matrix) Dj and a measure vector M with the same number of variables as Dj can be represented by their inner product, DjM. HOV3 uses a measure vector M to represent the corresponding axes weight values. Then given a non-zero measure n vector M in , and a family of vectors Pj, the projection of Pj against M, according to formula (1), the HOV3 model is presented as:
k Pj (z 0 ) = [(d jk min d k ) / (max d k min d k ) z 0 m k ] n k =1

(2)

where mk is the kth attribute of measure M . The aim of interactive adjustments of Star Coordinates and iVIBRATE is to have some separated groups or full-separated clustering result of data by tuning the weight value of each axis, but their arbitrary and random adjustments limit their applicability. As shown in formula (2), HOV3 summarizes these adjustments as a coefficient/measure vector. Comparing the formulas (1) and (2), it can be observed that

340

K.-B. Zhang, M.A. Orgun, and K. Zhang

HOV3 subsumes the Star Coordinates model [22]. Thus the HOV3 model provides users a mechanism to quantify a prediction about a data set as a measure vector of HOV3 for precisely exploring grouping information. Equation (2) is a standard form of linear transformation of n variables, where mk is the coefficient of kth variable of Pj. In principle, any measure vectors, even in complex number form, can be introduced into the linear transformation of HOV3 if it can distinguish a data set into groups or have well separated clusters visually. Thus the rich statistical methods of reflecting the characteristics of data set can be also introduced as predictions in the HOV3 projection, such that users may discover more clustering patterns. The detailed explanation of this approach is presented next.

3 Predictive Visual Cluster Analysis by HOV3


Predictive exploration is a mathematical description of future behavior based on historical exploration of patterns. The goal of predictive visual exploration by HOV3 is that by applying a prediction (measure vector) to a dataset, the user may identify the groups from the result of visualization. Thus the key issue of applying HOV3 to detect grouping information is how to quantify historical patterns (or users domain knowledge) as a measure vector to achieve this goal. 3.1 Multiple HOV3 Projection (M-HOV3) In practice, it is not easy to synthesize historical knowledge about a data set into one vector; rather than using a single measure to implement a prediction test, it is more suitable to apply several predictions (measure vectors) together to the data set, we call this process multiple HOV3 projection, M-HOV3 in short. Now, we provide the detailed description of M-HOV3 and its feature of enhanced group separation. For simplifying the discussion of the M-HOV3 model, we give a definition first. Definition 1. (poly-multiply vectors to a matrix) The inner product of multiplying a series of non-zero measure vectors M1, M2,,Ms to a matrix A is denoted as A* M i =A*M1*M2*.*Ms.
i =1 s

Zhang et al [23] gave a simple notation of HOV3 projection as Dp=HC (P, M), where P is a data set; Dp is the data distribution of P by applying a measure vector M. Then the projection of M-HOV3 is denoted as Dp=HC (P, M i ). Based on equation (2), we
i =1 s

formulate M-HOV3 as:


k Pj (z 0 ) = [(d jk min d k ) / (max d k min d k ) z 0 m ik ] n s k =1 i =1

(3)

where mik is the kth attribute (dimension) of the ith measure vector Mi, and s1. When s=1, the formula (3) is transformed to formula (2). We may observe that instead of using a single multiplication of mk in formula (2), it is replaced by a poly-multiplication of m ik in formula (3). Formula (3) is more
i =1 s

A Prediction-Based Visual Approach

341

general and also closer to the real procedure of cluster detection, because it introduces several aspects of domain knowledge together into the cluster detection. In addition, the effect of applying M-HOV3 to datasets with the same measure vector can enhance the separation of grouped data points under certain conditions.
3.2 The Enhanced Separation Feature of M-HOV3

To explain the geometrical meaning of M-HOV3 projection, we use the real number system. According to equation (2), the general form of the distance s (i.e., weighed Minkowski distance) between two points a and b in HOV3 plane can be represented as:

(a, b, m) = q | m k (a k b k ) |
k =1

(q>0)

(4)

If q = 1, s is Manhattan (city block) distance; and if q = 2, s is Euclidean distance. To simplify the discussion of our idea, we adopt the Manhattan metric for the explanation. Note that there exists an equivalent mapping (bijection) of distance calculation between the Manhattan and Euclidean metrics [6]. For example, if the distance between points a and b is longer than the distance between points a and b in then Manhattan metric, it is also true in the Euclidean metric, and vice versa. Then the Manhattan distance between points a and b is calculated as in formula (5).

(a, b, m) = mk (a k bk )
k =1

(5)

According to formulas (2), (3) and (5), we can present the distance of M-HOV3 in Manhattan distance as follows:

(a , b, m i ) = | m k i (a k b k ) |
i =1 k =1 i =1

(6 )

Definition 2. (the distance representation of M- HOV3) The distance between two data

points a and b projected by M- HOV3 is denoted as M ab . In particular, if the measure i =1


vectors in an M-HOV3 are the same, M ab can be simply written as M sab; if each i =1 attribute of M is 1 (no measure case), the distance between points a and b is denoted as sab.
s Thus, we have M ab =HC ((a,b), M i ). For example, the distance between two points i =1 i =1

a and b projected by M-HOV3 with the same two measures can be represented as M2sab. Thus the projection of HOV3 of a and b can be written as Msab. We now give several important properties of M- HOV3 as follows. Lemma 1. In Star Coordinates space, if sab0 and M0 ($mkM0<|mk|<1), then sab > Msab. Proof sab= (a k b k ) and Msab= m k (a k b k )
k =1 k =1 n n n n n

sab- Msab= (a k b k ) - m k (a k b k ) = (a k b k ) (1 | m k |)
k =1 k =1 k =1

342

K.-B. Zhang, M.A. Orgun, and K. Zhang

M0 {$mk0 mkM | 0<|mk|<1, k=1n} (1 | m k |) >0


sab0 sab >(Msab)

This result shows that the distance Msab between points a and b projected by HOV3 with a non-zero M is less than the original distance sab between a and b.
Lemma 2. In Star Coordinates space, if sab0 and M0 ("mkM0<|mk|<1), then Mnsab > Mn+1sab, n. Proof

Let Mnsab=sab Definition 1 Mn+1sab= Msab Lemma 1 sab >Msab Mnsab > Mn+1sab

In general, it can be proved that in Star Coordinates space, if sab0 and M0 ("mkM|mk|<1), then Mmsab > Mnsab, n, m and m<n.
Theorem 1. If the measure vector is changed from M to M, (|mk|1,| |mt,+t|<1) and |Msab -Msac | <| Msab - Msac | then | M 'ab M 'ac | | M' 2 ab M ' 2 ac | | Mab Mac | | M 'ab M 'ac | > | M'ab M 'ac | | Mab Mac | Proof

Msab= m 'k (a k b k ) and Msac= m 'k (a k c k )


k =1 k =1

Msab - Msac = |m 'k | [| (a k b k ) | | (a k c k ) |]


k =1 n

M sac - M sab= |m 'k | [| (a k b k ) | | (a k c k ) |]


2 2
2
k =1

Let |ak-bk|=xk and |ak-ck|=yk Msac - Msab = |m 'k | [| (a k b k ) | | (a k c k ) |] = |m 'k | ( x k y k )


k =1 k =1 n n

M2sac - M2sab= |m 'k | ( x k y k )


2
k =1

|Msac - Msab| = Msxy |M2sac - M2sab|= M2sxy | M ' 2 xy | | M ' 2 ab M' 2 ac | <1 <1 | M'xy | | M 'ab M'ac |

|M2sab - M2sac|<|Msab - Msac| |Msab -Msac | <| Msab - Msac | |M2sab - M2sac|.|Msab -Msac|<|Mscab - Msac|2 | M ' 2 ab M ' 2 ac | | M 'ab M'ac | < | M 'ab M 'ac | | Mab Mac | 2 2 | M ' ab M ' ac | | M 'ab M'ac | 1>1| M 'ab M 'ac | | Mab Mac |

Lemma 2 M2sxy< Msxy

A Prediction-Based Visual Approach

343

| M 'ab M 'ac | | M' 2 ab M ' 2 ac | | Mab Mac | | M 'ab M 'ac | > | M'ab M 'ac | | Mab Mac |

Theorem 1 shows that if the user observes that the difference of the distance between a and b and the distance between a and c are increased relatively (it can be observed by the footprints of points a, b and c, as shown in Fig 4) by tuning weight values of axes from M to M, then after applying M-HOV3 to a, b and c, the distance variation rate of the distances between pairs of points a, b and a, c is enhanced, as presented in Fig 5.

In other words, if it is observed that several data point groups can be roughly separated visually (there may exist ambiguous points between groups) by projecting a measure vector in HOV3 to a data set, then applying M-HOV3 with the same measure vector to the data set would lead to the groups being more condensed, i.e., have a good separation of the groups.
3.3 Predictive Cluster Exploration by M-HOV3

According to the notation of HOV3 projection of a dataset P as Dp=HC (P, M), the Mn HOV3 is denoted as Dp=HC (P, M ) where n. We use the auto-mpg dataset again as an example to demonstrate predictive cluster exploration by M-HOV3. Fig. 6a illustrates the original data distribution of auto-mpg produced by HOV3 in MATLAB, where it is not possible to recognize any group information. Then we tuned each axis manually and had roughly distinguished three groups, as shown in Fig 6b. The weight values of axes were recorded as a vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95]. Fig. 6b shows that there exist several ambiguous data points between groups. Then we employed M2 (inner dot) as a predictive measure vector and applied it to data set auto-mpg. The projected distribution Dp2 of auto-mpg is presented in Fig 6c. It is much easier to identify 3 groups of auto-mpg in Fig 6c than in Fig 6b. To show the contrast between these two diagrams Dp1 and Dp2, we overlap them in Fig. 6d. By analyzing the data of these 3 groups, we have found that, group 1 contains 70 items and with original value 2 (sourcing Europe); group 2 has 79 instances and with original 3 (Japanese product); and group 3 includes 249 records with original 1 (from USA). Actually this natural grouping based on the users intuition serendipitously clustered the data set according to the original attribute of autompg. In the same way, the user may find more grouping information from the interactive cluster exploration by applying predictive measurement.

Fig. 5. The contraction and separation effect of M-HOV3

344

K.-B. Zhang, M.A. Orgun, and K. Zhang

Fig. 6a. The original data distribution of auto-mpg

Fig. 6b. Dp1=HC (auto-mpg, M)

Fig. 6c. Dp2=HC (auto-mpg, M2)

Fig. 6d. The overlapping diagram of Dp1 and Dp2

Fig. 6. Diagrams of data set auto-mpg projected by HOV3 in MATLAB

3.4 Predictive Cluster Exploration by HOV3 with Statistical Measurements

Many statistical measurements, such as mean, median, standard deviation and etc. can be directly introduced into HOV3 as predictions to explore data distributions. In fact, prediction based on statistical measurements is more purposefully cluster exploration, and give an easier geometrical interpretation of the data distribution. We use the Iris dataset as an example. As shown in Fig. 3, by random axis scaling, the user can divide the Iris data in 3 groups. This example exhibits that cluster exploration based on random adjustment may expose data groping information, but sometimes, it is hard to interpret such groupings. We employ the standard deviation of Iris M = [0.2302, 0.1806, 0.2982, 0.3172, 0.4089] as a prediction to project Iris by HOV3 in iVIBRATE. The result is shown in Fig. 7, where 3 groups clearly exist. It can be observed in Fig 7 that, there is a blue point in the pink-colored cluster and a pink point in the green-colored cluster, resulting from the K-means clustering algorithm with k=3. Intuitively, they have been wrongly clustered. We re-clustered them by their distributions, as shown in Fig 8. The contrast of clusters (Ck) produced by the K-means clustering algorithm and new clustering result (CH) projected by HOV3 is summarized in Table 1. We can see that the

A Prediction-Based Visual Approach

345

Fig. 7. Data distribution projected by HOV3 in iVIBRATE of Iris with cluster indices maked by K-means

Fig. 8. Data distribution projected by HOV3 in iVIBRATE of Iris with the new clustering indices by the users intuition

quality of the new clustering result of Iris is better than that obtained by K-means according to their Variance comparison. Each cluster projected by HOV3 has a higher similarity than that produced by K-means. By analyzing the new grouping data points of Iris, we have found that they are distinguished by the class attribute of Iris, i.e. Irissetosa, Iris-versicolor and Iris-virginica. The cluster 1 generated by K-means is an outlier.
Table 1. The statistics of the clusters in Iris produced by HOV3 with predictive measure
Ck 1 2 3 4 % 1.333 32.667 33.333 33.333 Radius 1.653 5.754 8.196 7.092 Variance 2.338 0.153 0.215 0.198 MaxDis 3.306 6.115 8.717 7.582 CH 1 2 3 % 33.333 33.333 33.333 Radius 5.753 8.210 7.112 Variance 0.152 0.207 0.180 MaxDis 6.113 8.736 7.517

With the statistical predictions in HOV3 the user may even expose the cluster clues that are not easy to be found by random adjustments. For example, we adopted the 8th row of auto-mpgs covariance matrix as a predictive measure (0.04698, -0.07657, 3 0.06580, 0.00187, -0.05598, 0.01343, 0.02202, 0.16102) to project auto-mpg by HOV in MATLAB. The result is shown in Fig 9. We grouped them by their distribution as in Fig 10. Table 2 (right part) reports the statistics of the clusters generated by the projection of HOV3, and reveals that the points in each cluster have very high similarity. As we chose the 8th row of auto-mpgs covariance matrix as the prediction, the result mainly depends on the 8th column of auto-mpg data, i.e., origin (country). Fig. 10 shows that C1, C2 and C3 are closer because they have the same origin value 1. The more detailed formation of clusters is given in the right part of Table 2.We believe that a domain expert could give a better and intuitive explanation about this clustering. Then we chose number 5 to cluster auto-mpg by the K-means. Its clustering result is presented in the left part of Table 2. Comparing their corresponding statistics, we can see that according to the Variance of clusters, the quality of the clustering result by

346

K.-B. Zhang, M.A. Orgun, and K. Zhang

Fig. 9. Data distribution projected by HOV3 in MATLAB of auto-mpg with 8th row of automaps covariance matrix as prediction

Fig. 10. Clustered distribution of data in Fig. 8 by the users intuition

Table 2. The statistical contrast of clusters in auto-mpg produced by K-means and HOV3
Clusters produced by K-means (k=5) Clusters generated by the user intuition on the data distribution % Radius Variance MaxDis Origin Cylinders % Radius Variance MaxDis 0.503 681.231 963.406 1362.462 1 8 25.879 4129.492 0.130 4129.768 18.090 2649.108 0.206 2649.414 1 6 18.583 3222.493 0.098 3222.720 16.080 2492.388 0.139 2492.595 1 4 18.090 2441.881 0.090 2442.061 21.608 3048.532 0.207 3048.897 2 4 17.588 2427.449 0.142 2427.632 25.377 3873.052 0.220 3873.670 3 3 19.849 2225.465 0.093 2225.658 2417.990 18.593 2417.804 0.148

C 1 2 3 4 5 6

HOV3 with covariance prediction of auto-mpg is better than that one produced by Kmeans (k=5, cluster 1 produced by K-means is an outlier).
3.5 Predictive Cluster Validation by HOV3

In practice, with extremely large sized datasets, it is infeasible to cluster an entire data set within an acceptable time scale. A common solution used in data mining is that, clustering algorithms are first applied to the training (a sampling) subset of data from a database to extract cluster patterns, and then the cluster scheme is assessed to see whether it is suitable for other subsets in the database. This procedure is regarded as external cluster validation [21]. Due to the high computational cost of statistical methods on assessing the consistency of cluster structures between large sized subsets, to achieve this goal by statistical methods is still a challenge in data mining. Based on the assumption that if two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high, Zhang et al proposed a visual external validation approach by HOV3 [23]. Technically, their approach uses a clustered subset and a same-sized unclustered subset from a database as the observation by applying the measure vectors that can separate clusters in the clustered subset by HOV3. Thus each cluster and its geometrically covered data points (called quasiCluster in their approach) are selected. Finally, the overlapping rate of each

A Prediction-Based Visual Approach

347

cluster-quasicluster pair is calculated; and if the overlapping rate approaches 1, this means that the two subsets have a similar cluster distribution. Compared with the statistics-based validation methods, their method is not only visually intuitive, but also more effective in real applications [23]. As mentioned above, sometimes, it is time consuming to separate clusters manually in Star Coordinates or iVIBRATE. Thus, separation of clusters from lots of overlapping points is an aim of this research. As we described above, the approaches such as M-HOV3 and HOV3 with statistical measurement can be introduced into external cluster validation by HOV3. In principle, any linear transforma- Fig. 11. The data distribution of auto-mpg projected by HOV3 with cos(M*10i ) as the tion can be employed into HOV3 if it can prediction separate clusters well. We therefore introduce the complex linear transformation to this process. We again use auto-mpg data set as an example. As shown in Fig. 6b, three roughly separated clusters appear there, where the vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95] was obtained from the axes values. Then we adopt cos(M10i ) as a prediction, where i is the imaginary unit. The projection of HOV3 with cos(M10i ) is illustrated in Fig. 11, where three clusters are separated very well. In the same way, many other linear transformations can be applied to different datasets to obtain well-separated clusters. With the fully separated clusters, there will be marked improvement of the efficiency of visual cluster validation.

4 Related Work
Visualization is typically employed as an observational mechanism to assist users with intuitive comparisons and better understanding of the studied data. Instead of quantitatively contrasting clustering results, most of the visualization techniques employed in cluster analysis focus on providing users with an easy and intuitive understanding of the cluster structure, or explore clusters randomly. For instance, Multidimensional Scaling, MDS [14] and Principal Component Analysis, PCA [10] are two commonly used multivariate analysis techniques. However, the relative high computational cost of MDS (polynomial time O(N2)) limits its usability in very large datasets, and PCA first has to find the correlated variables for reducing the dimensionality, which makes it not suitable for unknown data exploration. OPTICS [1] uses a density-based technique to detect cluster structure and visualizes clusters in Gaussian bumps, but its non-linear time complexity makes it neither suitable for dealing with very large data sets, nor for providing the contrast between clustering results. H-BLOB visualizes clusters into blob manners in a 3D hierarchical structure [17]. It is an intuitive cluster rendering technique, but its 3D and two stages expression restricts it from interactively investigating cluster structures apart from existing clusters. Kaski el. al [13] uses Self-organizing maps (SOM) to project high-dimensional data sets to 2D space for matching visual models [12]. However, the SOM technique is based

348

K.-B. Zhang, M.A. Orgun, and K. Zhang

on a single projection strategy and it is not powerful enough to discover all the interesting features from the original data set. Huang et. al [7, 8] proposed the approaches based on FastMap [5] to assist users in identifying and verifying the validity of clusters in visual form. Their techniques work well in cluster identification, but are unable to evaluate the cluster quality very well. On the other hand, these techniques are not well suited to the interactive investigation of data distributions of high-dimensional data sets. A recent survey of visualization techniques in cluster analysis can be found in the literature [18].

5 Conclusions
In this paper, we have proposed a prediction-based visual approach to explore and verify clusters. This approach uses the HOV3 projection technique and quantifies the previously obtained knowledge and statistical measurements about a high dimensional data set as predictions, so that users can utilize the predictions to project the data on 2D plane in order to investigate grouping clues or verify the validity of clusters based on the distribution of the data. This approach not only inherits the intuitive and easy understanding features of visualization, but also avoids the weaknesses of randomness and arbitrary exploration of the existing visual methods employed in data mining. As a consequence, with the advantage of the quantified predictive measurement of this approach, users can identify the cluster number in the pre-processing stage of clustering efficiently, and also can intuitively verify the validity of clusters in the post-processing stage of clustering.

References
1. Ankerst, M., Breunig, M.M., Kriegel, S.H.P.J.: OPTICS: Ordering points to identify the clustering structure. In: Proc. of ACM SIGMOD Conference, pp. 4960. ACM Press, New York (1999) 2. Ankerst, M., Keim, D.: Visual Data Mining and Exploration of Large Databases. In: 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD01), Freiburg, Germany (September 2001) 3. Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Jacob, K., Charles, N., Marc, T. (eds.) Grouping Multidimensional Data, pp. 2572. Springer, Heidelberg (2006) 4. Chen, K., Liu, L.: iVIBRATE: Interactive visualization-based framework for clustering large datasets. ACM Transactions on Information Systems (TOIS) 24(2), 245294 (2006) 5. Faloutsos, C., Lin, K.: Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets. In: Proc. of ACM-SIGMOD, pp. 163174 (1995) 6. Fleming, W.: Functions of Several Variables. In: Gehring, F.W., Halmos, P.R. (eds.) 2nd edn. Springer, Heidelberg (1977) 7. Huang, Z., Cheung, D.W., Ng, M.K.: An Empirical Study on the Visual Cluster Validation Method with Fastmap. In: Proc. of DASFAA01, pp. 8491 (2001) 8. Huang, Z., Lin, T.: A visual method of cluster validation with Fastmap. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 153164. Springer, Heidelberg (2000)

A Prediction-Based Visual Approach

349

9. Jain, A., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264323 (1999) 10. Jolliffe Ian, T.: Principal Component Analysis. Springer Press, Heidelberg (2002) 11. Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proc. of ACM SIGKDD Conference, pp. 107116. ACM Press, New York (2001) 12. Kohonen, T.: Self-Organizing Maps, 2nd extended edn. Springer, Berlin (1997) 13. Kaski, S., Sinkkonen, J., Peltonen, J.: Data Visualization and Analysis with SelfOrganizing Maps in Learning Metrics. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 162173. Springer, Heidelberg (2001) 14. Kruskal, J.B., Wish, M.: Multidimensional Scaling, SAGE university paper series on quantitive applications in the social sciences, pp. 711. Sage Publications, CA (1978) 15. Oliveira, M.C., Levkowitz, H.: From Visual Data Exploration to Visual Data Mining: A Survey. IEEE Transaction on Visualization and Computer Graphs 9(3), 378394 (2003) 16. Pampalk, E., Goebl, W., Widmer, G.: Visualizing Changes in the Structure of Data for Exploratory Feature Selection. In: SIGKDD 03, Washington, DC, USA (2003) 17. Sprenger, T.C, Brunella, R., Gross, M.H.: H-BLOB: A Hierarchical Visual Clustering Method Using Implicit Surfaces. In: Proc. of the conference on Visualization 00, pp. 61 68. IEEE Computer Society Press, Los Alamitos (2000) 18. Seo, J., Shneiderman, B.: From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments. In: Hemmje, M., Niedere, C., Risse, T. (eds.) From Integrated Publication and Information Systems to Information and Knowledge Environments. LNCS, vol. 3379, Springer, Heidelberg (2005) 19. Shneiderman, B.: Inventing Discovery Tools: Combining Information Visualization with Data Mining. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 1728. Springer, Heidelberg (2001) 20. Weiss, S.M., Indurkhya, N.: Predictive Data Mining: A Practical Guide. Morgan Kaufmann Publishers, San Francisco (1998) 21. Vilalta, R., Stepinski, T., Achari, M.: An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) 22. Zhang, K-B., Orgun, M.A., Zhang, K.: HOV3, An Approach for Cluster Analysis. In: Li, X., Zaane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 317328. Springer, Heidelberg (2006) 23. Zhang, K-B., Orgun, M.A., Zhang, K.: A Visual Approach for External Cluster Validation. In: Proc. of IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, pp. 576582. IEEE Press, Los Alamitos (2007)

Predictive Hypothesis Oriented Cluster Analysis by Visualization


Ke-Bing Zhang1, Mehmet A. Orgun1, Kang Zhang2
1

Department of Computing, ICS, Macquarie University, NSW 2109, Australia {kebing, mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas TX 75083-0688, USA kzhang@utdallas.edu

Abstract
Clustering is a widely applied technique in data mining and many clustering algorithms have been developed for real-world applications. However, when dealing with arbitrarily shaped cluster distributions, most existing automated clustering algorithms suffer in terms of efficiency and they are sometimes not suitable for clustering extremely large and high dimensional datasets. On the other hand, high computational cost of statisticsbased cluster validation methods is another obstacle to the applications of cluster analysis in data mining. As a remedy, visualization techniques have been introduced into cluster analysis and they have been very helpful for the analysis of high-dimensional data. However, most visualization techniques employed in cluster analysis are mainly used as tools for information rendering, rather than investigating how data behavior changes with the variations of the parameters of the algorithms. In addition, the impreciseness of visualization limits its usability in contrasting grouping information of data precisely. This paper presents a visual technique called HOV3 (Hypothesis Oriented Verification and Validation by Visualization) to map high dimensional data onto 2D space with quantified measurements. Therefore, users can quantify the domain knowledge or the historical patterns about datasets as predictions to detect clusters and verify clustering results effectively by HOV3.

Introduction

Predictive knowledge discovery is regarded as the procedure of using the existing knowledge to deduce, reason and establish predictions and verify the validity of the predictions. By validation processing, the knowledge may be revised and enriched with new knowledge [Wei98]. The methodology of predictive knowledge discovery is also used in the clustering process [Ber06]. Cluster analysis is a very important knowledge mining method of large-scale data. It is widely applied in data mining in such areas as image processing, marketing, customer behavior analysis, business trend prediction, bioinformatics to geology science, and so on. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at identifying objects into groups according to the given criteria. Each group of objects is called a cluster, where the similarity of objects is high within clusters and low between clusters. To achieve different application purposes, a large number of clustering algorithms have been developed [JMF99, Ber06]. However, there are no general-purpose clustering algorithms that fit all kinds of applications, thus, the evaluation of the quality of clustering results takes the critical role of cluster analysis, i.e., cluster

validation, which aims to assess the quality of clustering results and find a fit cluster scheme for a specific application. In practice, cluster analysis is not always successfully applied to databases in data mining. This is because most of the existing automated clustering algorithms do not deal with arbitrarily shaped data distribution of the datasets very well, and statistics-based cluster validation methods incur a very high computational cost in cluster analysis. The users initial estimation of the cluster number is very important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the users clear understanding of the cluster distribution is helpful for assessing the quality of clustering results in the post-processing stage of clustering. All these issues rely heavily on the users visual perception of the data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Therefore, the introduction of visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. The visualization methods utilized in cluster analysis map high-dimensional data to 2D or 3D space and aid users having an intuitive and easy understanding graph and/or image to reveal the grouping relationships among the data. Visual cluster analysis is a combination of visualization and cluster analysis. As an indispensable exploration technique, visualization is almost involved into every step in cluster analysis. Clustering algorithms normally deal with data sets in high dimensions (>3D). Thus, the choice of a technique fit for visualizing clusters of high dimensional data is the first task of visual cluster analysis. Many research efforts have been made on multidimensional data visualization [WoB94], but those earlier techniques of multidimensional data visualization are not suitable to visualize cluster structure in very high dimensional and very large datasets. With the increasing applications of clustering in data mining in the last decade, more and more visualization techniques have been developed to study the structure of datasets in the applications of cluster analysis [OlL03, Shn05]. However, in practice, those visualization techniques intend to take the problem of cluster visualization simply as a layout problem. They mainly focus on rendering the cluster structure, rather than investigating how data behavior changes with the variation of the parameters of the algorithms used. There have been many research efforts on visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00]. They normally facilitate an arbitrary exploration of group information and this causes them to be inefficient and time consuming in the cluster exploration stage. On the other hand, the impreciseness of visualization limits its usability in quantitative verification and validation of clustering results. Thus the motivation of our work is the developing a visualization technique that supports more purposeful cluster detection and contrasts more precise data distribution to facilitate researchers in cluster analysis. As the solution of above problems, in this paper, we propose a novel visual project technique Hypothesis Oriented Verification and Validation by Visualization, (HOV3), which project high-dimensional datasets by users quantified measurements to 2D space [ZOZ06]. Based on the quantified measurement feature of HOV3, we also address a distribution matching based visual external cluster validation model to verify the consistency of cluster structures between clustered subset and non-subsets [ZOZ07a]. To separate clusters overlapping, in this paper, we introduce a visual approach called M-HOV3 to enhance the visual separation of clusters [ZOZ07b]. With the enhanced separation feature of M-HOV3, the user not only separate overlapped clusters efficiently in the post-processing stage of clustering, but also can obtain more cluster clues effectively in the pre-processing stage of clustering.

The rest of this paper is organized as follows. Section 2 briefly introduces the current issues of cluster analysis, and visual techniques that have been employed in cluster analysis. A review of related work on cluster analysis by visualization is presented in Section 3. Section 4 describes our Hypothesis Oriented Verification and Validation by Visualization (HOV3) model. Section 5 demonstrates the use of HOV3 to achieve cluster exploration purposefully with given quantified measurements as predictions, and to verify the consistency of the cluster structure by matching a distribution based on the quantified measurements of HOV3. Section 6 focuses on external cluster validation by HOV3 on several well-known data sets. Finally, section 7 summarizes the contributions of this paper.

Background

Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain knowledge to databases. 2.1 The Issue of Clustering Clustering takes the responsibility to assign objects into groups about the studied data based on a given distinguishing strategy. Hundreds of clustering algorithms have been developed to deal with data sets in different real world applications [JMF99, Ber06]. The existing clustering algorithms are not always successfully applied to very huge databases, because they perform well when clustering spherical or regularly shaped datasets, but are not very effective to deal with arbitrarily shaped clusters. Several research efforts have been made to deal with datasets with arbitrarily shaped cluster distributions [ZRL96, EKS+96, SCZ98, RGS98, ABK+99]. However, those approaches still have some drawbacks in handling irregular shaped clusters. For example, CURE [GRS98] and BIRCH [ZRL96] perform well in low dimensional datasets, however, as the dimensionality of the data increases, they encounter high computational complexity. Other approaches such as density-based clustering techniques DBSCAN [EKS+96] and OPTICS [ABK+99] and wavelet based clustering techniques such as WaveCluster [SCZ98] attempt to cope with this problem, but their non-linear complexity often makes them unsuitable in the analysis of very large datasets. As Abul et al pointed out In high dimensional spaces, traditional clustering algorithms tend to break down in terms of efficiency as well as accuracy because data do not cluster well anymore [AAP+03]. A recent survey of clustering algorithms can be found in the literature [JMF99, Ber06]. 2.2 The Issue of Cluster Validation The selection of a cluster scheme from hundreds of clustering algorithms with variable parameters that is fit for a specific application is hard. Different clustering results may be obtained by applying different clustering algorithms to the same data set, or even by applying a clustering algorithm with different parameters to the same data set. However, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation, since cluster validation is the procedure of comparing previously produced cluster patterns with newly produced cluster patterns to evaluate the genuine cluster structure about a data set. In general, the methods of cluster validation are classified into the following three categories [HBV01, JMF99, ThK99]: (1) Internal approaches: they assess the clustering results by applying an algorithm with different parameters on a data set and finding the optimal solution [AAP+03]; (2) Relative approaches: the idea of relative assessment is based on the evaluation of a clustering structure by comparing it to other clustering schemes [HaK01]; and (3) External approaches: the external assessment of clustering is based

on the idea that there exists known priori clustered indices produced by a clustering algorithm, and then assessing the consistency of the clustering structures generated by applying the clustering algorithm to different data sets [HKK05]. 2.3 External Cluster Validation External cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, it is compared with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Figure 1.

Figure 1. External cluster validation by statistics-based methods

The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [Ran71], Jaccard Coefficient [Jac08], Folkes and Mallows index [MSS83], Huberts statistic and Normalized statistic [ThK99], and Monte Carlo method [Mil81] to measure the similarity between the priori modeled partitions and clustering results of a dataset. Recent surveys on cluster validation methods can be found in the literature [HBV02, HKK05, ThK99].

3 Related Work
This section discusses related work on visualization techniques and tools that have been proposed for cluster representation and analysis. Of particular interest is the Star Coordinates technique proposed by Kandogan [Kan01] that inspired HOV3. 3.1 Visual Cluster Representation There have been many studies on multidimensional data visualization. However, most of the proposed techniques do not visualize the cluster structure very well for high dimensional or very large databases [OlL03]. For example, icon-based methods [Pic70, Che73, KeK94] can display high-dimensional properties of data. However, as the amount of data increases substantially, the user may find it hard to understand most properties of data intuitively, since the user cannot focus on the details of each icon. Plotbased data visualization approaches such as Scatterplot-Matrices [Cle93] and similar techniques [AlC91] [CBC+5] visualize data in rows and columns of cells containing simple graphical depictions. This kind of techniques give visual information of bi-attributes, but do not give the best overview of the whole dataset; and they are slimly not able to present clusters in the dataset very well. Parallel Coordinates [Ins97] utilizes equidistant parallel axes to visualize each attribute of a given

dataset and projects multiple dimensions on a two-dimensional surface. Star Plots [Fie79] arranges coordinate axes on a circular space with equal angles between neighbouring axes from the centre of a circle and links data points on each axis by lines to form a star. In principle, those techniques can provide visual presentations of any number of attributes. However, neither parallel Coordinates nor Star plots is adequate to give the user a clear overall insight of the data distribution when the dataset is huge, primarily due to the unavoidably high overlapping among points. And another drawback of these two techniques is that while they can supply a more intuitive visual relationship between the neighbouring axes, for the nonneighbouring axes, the visual presentation may confuse the users perception. 3.2 Visual Cluster Analysis A large number of clustering algorithms have been developed, but only a small number of cluster visualization tools are available to facilitate researchers understanding of the clustering results [SeS05]. Several research efforts have been made in the area of visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00]. While these techniques help users have intuitive comparisons and understand cluster structures better, they do not focus on the assessment of the quality of clusters. For example, OPTICS [ABK+99] uses a density-based technique to detect cluster structures and visualizes them in Gaussian bumps, but its non-linear time complexity makes it neither suitable to deal with very large data sets, nor to provide the contrast between clustering results. Kaski el. al [KSP01] employs Self-organizing maps (SOM) technique to project multidimensional data sets to 2D space for matching visual models [Koh97]. However, the SOM technique is based on a single projection strategy and it is not powerful enough to discover all the interesting features from the original data. H-BLOB visualizes clusters into blob manners in 3D hierarchical structure [SBG00]. It is an intuitive cluster rendering technique, but the 3D and two stage expression of H-BLOB limits it in the interactive investigation of cluster structures. Huang et. al [HCN01, HuL00] proposed the approaches based on FastMap [FaL95] to assist users on identifying and verifying the validity of clusters in visual form. Their techniques are good on cluster identification, but are not able to deal with the evaluation of cluster quality very well. HD-Eye [HKW99] is an interactive visual clustering system based on density-plots of any two interesting dimensions. But it lacks the ability in helping the user understand inter-cluster relationships. To verify the validity of clustering results by visualization, VISTA adopts landmark points as representatives from a clustered subset and re-samples them to deal with cluster validation [ChL04]. However, its experience-based landmark point selection does not always handle the scalability of data very well, because the representative landmark points selected in a subset may fail in other subsets of a database. Star Coordinates is very suitable for visual cluster analysis with its interactive adjustment features [Kan01]. Since the starting point of our approach reported in this paper is the Star Coordinates technique, we describe it in more detail next. The survey of other works on visual cluster analysis can be found in the literature [SeS05]. 3.3 Star Coordinates

The idea of Star Coordinates technique is intuitive, which extends the perspective of traditional orthogonal 2D X-Y and 3D X-Y-Z coordinates technique to a higher dimensional space [Kan01]. Star Coordinates plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space. First, data in each dimension are normalized into [0, 1] or [-1, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the X-Y 2D plane. Figure 2 illustrates the Figure 2. Positioning a point by an 8-attribute vector in Star Coordinates mapping from 8 Star Coordinates to X-Y coordinates. [Kan01]. Formula (1) states the mathematical description of Star Coordinates.
p j ( x, y ) = (
n i =1

u xi ( d ji min i ),

n i =1

u yi ( d ji min i ))

(1)

where pj (x, y) is the normalized location of Dj=(dj1, dj2, ., djm), and dji is the value of the jth record of a data set on the ith coordinate Ci in Star Coordinates space; xi(dji-mini) and yi(dji-mini) are unit vectors of dji mapping to X direction and Y direction, mini=min(dji,0 j<m) and maxi=max (dji, 0 j<m) are the minimum and maximum values of the ith dimension respectively; and m is the number of records in the data set. In practice, mapping high dimensional data to 2D space inevitably introduces overlapping and ambiguities, and even bias. To mitigate the problem, Star Coordinates and its extension VISTA [ChL04] provide several visual adjustment mechanisms, such as axis scaling, rotating axis angles, filtering data points to vary the data distribution of a dataset in order to detect cluster characteristics and render clustering results effectively. We briefly introduce two relevant adjustment features with this research below. Axis scaling The purpose of the axis scaling in Star Coordinates (called -adjustment in VISTA) is to interactively adjust the weight value of each axis so that users can observe the change in the data distribution dynamically. For example, the diagram in Figure 3 shows the original data distribution of Iris (Iris has 4 numeric attributes and 150 instances) with the cluster indices by applying the K-means (here k=3) clustering algorithm in VISTA, where clusters overlap. A well-separated cluster distribution of Iris is illustrated in Figure 4 by a series of random -adjustments, where clusters are much easier to be recognized than those of the original distribution in Figure 3.

Figure 3. The initial data distribution of clusters of Iris produced by k-means in VISTA

Figure 4. The separated version of the Iris data distribution in VISTA

Footprint For tracing data points changing in a certain period of time, the footprint function is provided by Star Coordinates. We use another data set auto-mpg to demonstrate the footprint feature. The data set auto-mpg has 8 attributes and 398 items. Figure 5 shows the footprints of axis tuning of attributes weight and mpg, where we may find some points with longer traces, and some with shorter ones.

Figure 5. Footprints of axis scaling of in Star Coordinates [Kan01]

The most prominent feature of Star Coordinates and its extensions such as VISTA [ChL04] and HOV3 [ZOZ06] is that their computational complexity is only in linear time. This makes them very suitable to be employed as a visual tool for interactive interpretation and exploration in cluster analysis. However, the exploration and refinement of clusters based on the users intuition may be random and subjective in visual cluster analysis, and as a result, sometimes the adjustments of Star Coordinates and VISTA could be arbitrary and time consuming. To overcome the arbitrary and random adjustments of Star Coordinates and VISTA, we have proposed a hypothesis-oriented visual approach called HOV3 to detect clusters [ZOZ06]. We present the detailed description of HOV3 model in the next section.

4 HOV3 Model
Cluster exploration (qualitative analysis) is regarded as the pre-processing of cluster validation (quantitative analysis), which is mainly used for building user hypotheses/predictions based on the cluster exploration. This is not an aimless and/or arbitrary process. Having a precise overview of the data distribution in the early stages of data mining is important, because, with correct insights of data, data miners can make more informed decisions on adopting appropriate algorithms for the forthcoming analysis stages. To fill the gap between the imprecise visual cluster analysis and the unintuitive numerical cluster analysis, we have proposed a new approach, called HOV3, Hypothesis Oriented Verification and Validation by Visualization [ZOZ06]. 4.1 The Basic Idea of HOV3 When we discuss the measurement of an object, first we must provide a coordinate system for the discussion. For example, without another contrasting object, the user cannot have any idea about the bigness of the object. Based on the same principle, the idea of HOV3 is more concerned with how to obtain cluster clues by contrasting a data set against quantified measurements, rather than the random adjustments of Star Coordinates and VISTA. In analytic geometry, the difference of two vectors A=(a1, a2, , an) and B=( b1, b2, , bn) can be represented by their inner/dot product, denoted as A.B. Let and. We use the notation <A, B> for their inner product given as: <A, B>= b1. a1+ b2. a2+ + bn. an =
Then we have the equation: cos( ) =
A, B A B
n k =1

bka k

(2)

, where

is the angle between A and B, and |A| and |B| are

the lengths of A and B respectively, shown as |A|=

a1 2 + a 2 2 + ... + an 2 and |B|= b1 2 + b2 2 + ... + bn 2 .

Let A be a unit vector; the geometry of <A, B> in Polar Coordinates presents the gap from point B (db, ) to point A, as shown in Figure 6, where A and B are in 8 dimensional space. In the same way, a matrix Dj, a set of vectors (dataset) can also be mapped to a measure vector M. As a result, it projects the distribution of the matrix Dj based on the vector M. Let Dj=( dj1, dj2, , djn) and M=( m1, m2, , mn), then the inner product of each vector dji, (i =1, , n) of Dj with M has the same equation as (2) and written as: < dji, M>= m1. dj1+ m2. dj2+ + mn. djn=
n k =1

Figure 6. Vector B projected against vector A in Polar Coordinates.

m k d jk

(3)

4.2 The Mathematical Description of HOV3


The Star Coordinates model can in fact be mathematically depicted by the Euler formula. According to the Eular formula: eix = cosx+isinx, where z = x + i.y, and i is the imaginary unit. Let z0=e2 i/n; such that z01, z02, z03,, z0n-1, z0n (with z0n = 1) divide the unit circle on the complex 2D plane into n equal sectors. Thus, the Star Coordinates model can be simply written as:

P j (z 0 ) =

k =1

[(d

k jk min d k / max d k min d k z 0

)(

(4)

where mindk and maxdk represent the minimal and maximal values of the kth coordinate respectively. Then, in any case, equation (4) can be viewed as mapping from
3

HOV uses a measure vector M to represent the corresponding axes weight values. Given a non-zero measure vector M in
n

and a family of vectors Pj, the HOV3 model is presented as:


P j (z 0 ) =
n k =1

[(d

k jk min d k / max d k min d k z 0 m k

)(

(5)

where mk is the kth attribute of measure M . Comparing the model of Star Coordinates in equation (4) and the model HOV3 in equation (5), we may observe that the HOV3 model subsumes of the Star Coordinates model. This is because, any axis scaling and axis angle rotation in Star Coordinates model or in VISTA can be viewed as changing one or more coefficient value of mk (k=1,, n) in equation (5). For example, either moving a coordinate axis to its opposite direction or scaling up the adjustment interval of axis from [0,1] to [-1,1] in VISTA can be regarded as setting the measure value as minus as its original one. As a special case, when all mk (k=1,, n) in M are set to 1, HOV3 is transformed into the Star Coordinates model (4), i.e., no measure case. Thus the HOV3 model provides users a mechanism to quantify domain knowledge about a data set as a measure vector (prediction) for precisely investigating cluster clues. Note that, the equation (5) is a standard form of linear transformation of n-variables, where mk is the coefficient of kth variable of Pj. In principle, any measure vector, even in complex number form, can be introduced into the linear transformation of HOV3 if it can distinguish a data set into groups or have well separated clusters visually. Thus the rich statistical methods of reflecting the characteristics of a data set can be also introduced as predictions in the HOV3 projection, such that users may discover more clustering patterns. The detailed explanation of this approach is presented in the next section.

5 Predictive Visual Cluster Exploration by HOV3


Predictive exploration is a mathematical description of future behavior based on the historical exploration of patterns. The goal of predictive visual exploration by HOV3 is that by applying a prediction (measure vector) to a dataset, the user may identify the groups from the result of visualization. Thus the key issue of applying HOV3 to detect grouping information is how to quantify historical patterns (or users domain knowledge) as a measure vector to achieve this goal.

5.1 Multiple HOV3 Projection (M-HOV3)


In practice, it is not easy to synthesize historical knowledge about a data set into one vector. So, rather than using a single measure to implement a prediction test, it is more suitable to apply several predictions (measure vectors) together to the data set. We call this process multiple HOV3 projection, M-HOV3 in short [ZOZ07b]. Now, we provide the detailed description of M-HOV3 and its feature of enhanced group separation. For simplifying the discussion of the M-HOV3 model, we give two definitions first. Definition 1 (HOV3 projection) A data projection from n-dimensional space to 2D plane by applying HOV3 to a data set , as shown in formula (5), is denoted as Dp=

( , M), where

is an n-dimensional data set, and

=(p1, p2, ., pm), pj (1

m) is an instance of ; m is the size of

the data set ;

M =(m1t, m2t, ., mnt), is a non-zero measure vector where mit (1 i n) is the weight value of the kth coordinate at moment t in Star Coordinates plane; Dp is the geometrical distribution of in 2D space, Dp =(p1(x1,y1), p2(x2,y2), ..., pj(xi,yi),, pm(xm,ym)),

is the location of pj in X-Y Coordinates plane. Definition 2 (poly-multiply vectors to a matrix) The inner product of multiplying a series of non-zero measure vectors M1, M2,,Ms to a matrix A is denoted as A* M i =A
i =1 s

M1 M2

Ms.

Then the projection of M-HOV3 is denoted as M-HOV3 as:


Pj (z 0 ) =
n k =1

p=

( , M i ). Based on equation (5), we formulate


i =1
s

[(d

jk

k min d k ) / (max d k min d k ) z 0 m ik i =1

(6)

where mik is the kth attribute (dimension) of the ith measure vector Mi, and s 1. When s=1, the formula (6) is transformed into equation (5). We may observe that instead of using a single multiplication of mk in equation (5), it is replaced by a poly-multiplication of m ik in formula (6). Formula (6) is more general and also closer to the real
i =1 s

procedure of cluster detection, as it introduces several aspects of domain knowledge together into the cluster detection process. In addition, the effect of applying M-HOV3 to datasets with the same measure vector can enhance the separation of grouped data points under certain conditions. We give the mathematical proof below.

5.2 The Enhanced Separation Feature of M-HOV3


To explain the geometrical meaning of M-HOV3 projection, we use the real number system. According to equation (5), the general form of the distance (i.e., weighed Minkowski distance) between two points a and b in the HOV3 plane can be represented as:

(a , b, m) = q

k =1

| m k (a k b k ) |

(q>0)

(7)

If q = 1, is the Manhattan (city block) distance; and if q = 2, is the Euclidean distance. To simplify the discussion of our idea, we adopt the Manhattan metric for the explanation. Note that there exists an equivalent mapping (bijection) of distance calculation between the Manhattan and Euclidean metrics [Fle77]. For example, if the distance between points a and b is longer than the distance between points a and b in the Manhattan metric, it is also true in the Euclidean metric, and vice versa. As shown in Figure 7, the orthogonal lines represent the Manhattan distance and the diagonal lines are the Euclidean distance (red for ab and blue for ab) respectively. Then the Manhattan distance between points a and b is calculated as in formula (8).

( a , b, m ) =

n k =1

m k (a k b k )

(8 )

According to the formulas (6), (7) and (8), we can present the distance of M-HOV3 in Manhattan distance as follows:

(a , b, m i ) =
i =1

k = 1 i =1

| m k i (a k b k ) |

(9) Figure 7. The distance representation in Manhattan and Euclidean metrics.

Definition 3 (the distance representation of M-HOV3) The distance between two data points a and b
projected by M- HOV3 is denoted as M ab . In particular, if the measure vectors in an M-HOV3 are the i =1
same, M ab can be simply written as M i =1 between points a and b is denoted as Thus, we have M ab = i =1
s
s

ab;

if each attribute of M is 1 (no measure case), the distance

ab.

((a,b), M i ). For example, the distance between two points a and b projected i =1

by M-HOV3 with the same two measures can be represented as M2 ab. Thus the projection of HOV3 of a and b can be written as M ab. We now give several important properties of M- HOV3 as follows. Lemma 1 In the Star Coordinates space, if ab 0 and M 0 ( mk M 0<|mk|<1), then ab > M ab. Proof:
ab= abn k =1

(a k b k ) and M
ab=
n k =1

ab=
n k =1

n k =1

m k (a k b k )
n k =1

(a k b k ) -

m k (a k b k ) =

(a k b k ) (1 | m k |)

M 0

{ mk 0 mk M 0<|mk|<1, k=1n}
ab

(1 | m k |) >0

ab 0

>(M

ab)

This result shows that the distance M ab between points a and b projected by HOV3 with a non-zero M is less than the original distance ab between a and b Lemma 2 In the Star Coordinates space, if ab 0 and M 0 ( mk M 0<|mk|<1), then Mn ab > Mn+1 ab, n . Proof: Let Mn ab= ab Definition 1 Mn+1 ab= M ab

Lemma 1

Lemma 3 In the Star Coordinates space, if ab 0 and M 0 ( mk M |mk|<1), then Mm m and m<n. Proof: According to the transitivity of inequality.

Theorem 1 If the measure vector is changed from M to M, (|mk| 1,| |mt,+ t|<1) and |M ab -M ac | <| M 2 2 | M ' ab M ' ac | | M 'ab M 'ac | | Mab Mac | | M ' ab M ' ac | - M ac | then > | M ' ab M ' ac | | Mab Mac | Proof:
M

ab=

n k =1

m ' (a k b k ) and M k M
ac = ab=
n k =1

ab cd -

M2 M

M2 M

Let |ak-bk|=xk, |ak-ck|=yk


cd ab =
n k =1

ab >M ab

Mn

ab

> Mn+1

ab ab

> Mn

ab, n

ab

cd=

n k =1

m ' (a k c k ) k

|m ' | [| (a k b k ) | | (a k c k ) |] k
2

k =1

|m ' | [| (a k b k ) | | (a k c k ) |] k |m ' | [| (a k b k ) | | (a k c k ) |] = k
n

k =1

|m ' | ( x k y k ) k

M2

cd cd -

M2 M M
2

ab= ab|

k =1

|m ' | ( x k y k ) k
xy
2

|M

= M M M

|M

cd

ab|= xy<

xy

Lemma 2

M2

xy

2 | M 'xy | <1 | M ' xy |

2 2 | M 'ab M 'ac | <1 | M ' ab M ' ac |

|M2 ab - M2 cd|<|M cab - M cd| |M ab -M ac | <| M ab - M ac | |M2 ab - M2 cd|.|M ab -M ac|<|M

cab -

cd|

2 2 | M 'ab M 'ac | | M ' ab M ' ac | < | M ' ab M ' ac | | Mab Mac | 2 2 | M ' ab M ' ac | | M 'ab M 'ac | 1>1| M ' ab M ' ac | | Mab Mac | 2 2 | M ' ab M ' ac | | M 'ab M 'ac | | Mab Mac | | M ' ab M ' ac | > | M ' ab M ' ac | | Mab Mac |

Theorem 1 shows that if the user observes that the difference of the distance between a and b and the distance between a and c is increased by tuning weight values of axes from M to M (which can be observed by the footprints of points a, b and c, as shown in Figure 5), then after applying M-HOV3 to a, b and c, the distance variation rate of the distances between pairs of points a, b and a, c is enhanced. In other words, if it is observed that several data point groups can be roughly separated visually by projecting a measure vector in HOV3 to a data set (there may exist ambiguous points between groups), then applying MHOV3 with the same measure vector to the data set would lead to the groups being more compacted, i.e., have a good separation of the groups. This enhanced separation feature of M-HOV3 is significant for identifying the membership formation of clusters during the exploration of clusters and the verification of the validity of the clustering structure in unclustered subsets [ZOZ07b]. We present several examples to demonstrate the efficiency and the effectiveness of M-HOV3 in cluster analysis below.

5.3 Predictive Cluster Exploration by M-HOV3


According to the notation of HOV3 projection of a dataset denoted as
p= C

as

p=

( , M), the M-HOV3 model is

( , Mn) where n

We use the auto-mpg dataset again as an example to demonstrate predictive cluster exploration by MHOV3. Figure 8a illustrates the original data distribution of auto-mpg produced by HOV3 in MATLAB where it is not possible to recognize any group information. Then we tuned each axis manually and had roughly distinguished three groups, as shown in Figure 8b. The weight values of axes were recorded as a vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95].

Figure 8a. auto-mpgs original data distribution

Figure 8b.

p1=

(auto-mpg, M)

Figure 8c.

p2=

(auto-mpg, M2)

Figure 8d. the overlapping diagram of

p1

and

p2

Figure 8 Diagrams of data set auto-mpg projected by HOV3 in MATLAB

Figure 8b shows that there exist several ambiguous data points between the groups. Then we employed M2 (inner dot) as a predictive measure vector and applied it to data set auto-mpg. The projected distribution Figure 8b. To show the contrast between these two diagrams
p1 and p2, p2

of

auto-mpg is presented in Figure 8c. It is much easier to identify 3 groups of auto-mpg in Figure 8c than in
we overlap them in Figure 8d. By analyzing the data of these 3 groups, we have found that, group 1 contains 70 items and with original value 2 (sourcing Europe); group 2 has 79 instances and with original 3 (Japanese product); and group 3 includes 249 records with original 1 (from USA). Actually this natural grouping based on the users intuition serendipitously clustered the data set according to the original attribute of auto-mpg. In the same way, the user may find more grouping information from the interactive cluster exploration process by applying predictive measurements.

5.4 Cluster Exploration by HOV3 with Statistical Measurements


Many statistical measurements, such as mean, median, standard deviation and others can be directly introduced into HOV3 as predictions to explore data distributions. In fact, prediction based on statistical measurements is a more purposeful cluster exploration, and easier to give a geometrical interpretation of the data distribution. We use the Iris dataset as an example to demonstrate cluster exploration with statistical measurements. As shown above in Figure 4, by random axis scaling, the user can divide the Iris data into 3 groups. This example shows that cluster exploration based on random adjustments may expose data grouping information, but sometimes, it is hard to find or interpret such grouping. Now, lets see that clustering Iris by HOV3 with a statistical measurement. First, we applied the Kmeans clustering algorithm with k=3 (three clusters) to Iris, and displayed the clustered Iris data by VISTA.

Its original distribution is shown in Figure 3. It can be observed that there exist overlapping points in Figure 3. Then we employed the standard deviation of Iris M = [0.2302, 0.1806, 0.2982, 0.3172, 0.4089] as a prediction to project the clustered Iris data by HOV3 in VISTA. The result is shown in Figure 9, where 3 groups clearly exist. It can be observed in Figure 9 that, there is a blue point in the pink-colored cluster and a pink point in the green-colored cluster, resulting from the K-means clustering algorithm with k=3. Intuitively, they have been wrongly clustered. We re-clustered them by their distributions, as shown in Figure 10.

Figure 9 data distribution of clustered projected by HOV3 in VISTA marked by K-means

Figure 10 data distribution of Iris projected by HOV3 in VISTA with the new clustering indices by the users intuition

The contrast of clusters (Ck) produced by the K-means clustering algorithm and the new clustering result (CH) projected by HOV3 is summarized in Table 1. We can see that the quality of the new clustering result of Iris is better than that obtained by K-means according to their Variance comparison. Each cluster projected by HOV3 has a higher similarity than that produced by K-means. By analyzing the new grouping data points of Iris, we have found that they are distinguished by the class attribute of Iris, i.e. Iris-setosa, Iris-versicolor and Iris-virginica. The cluster 1 generated by K-means is an outlier.
Table 1 The statistics of the clusters in Iris produced by K-means (k=3) and by HOV3 with predictive measures Ck 1 2 3 4 % 1.333 32.667 33.333 33.333 Radius 1.653 5.754 8.196 7.092 Variance 2.338 0.153 0.215 0.198 MaxDis 3.306 6.115 8.717 7.582 CH 1 2 3 % 33.333 33.333 33.333 Radius 5.753 8.210 7.112 Variance 0.152 0.207 0.180 MaxDis 6.113 8.736 7.517

With the statistical predictions in HOV3 the user may even expose the cluster clues that are not easy found by random adjustments. For example, we adopted the 8th row of auto-mpgs covariance matrix as a predictive measure [0.04698, -0.07657, -0.06580, 0.00187, -0.05598, 0.01343, 0.02202, 0.16102] to project auto-mpg by HOV3 in MATLAB. The result is shown in Figure 11. We grouped them by their distribution as in Figure 12. Table 2 reports the statistics of the clusters (in the left part of the table, i.e. CH), and reveals that the points in each cluster have very high similarity.

Figure 11 data distribution of auto-mpg projected by HOV3 in MATLAB with 8th row of auto-maps covariance matrix as prediction

Figure 12 clustered distribution of data in Fig. 8 by the users intuition

As we chose the 8th row of auto-mpgs covariance matrix as the prediction, the result mainly depends on the 8th column of auto-mpg data, i.e., origin (country). Figure 12 shows that C1, C2 and C3 are closer because they have the same origin value 1. The more detailed formation of clusters is given in Table 2.We believe that a domain expert could give a better and intuitive explanation about this clustering.
Table 2 The statistics of clusters in auto-mpg produced by HOV3 with covariance prediction of auto-mpg CH 1 2 3 4 5 origin Cylinder % (auto) (auto) 1 1 1 2 3 8 6 4 4 3 25.879 18.583 18.090 17.588 19.849 Radius 4129.492 3222.493 2441.881 2427.449 2225.465 Variance MaxDis 0.130 0.098 0.090 0.142 0.093 4129.768 3222.720 2442.061 2427.632 2225.658
CK %

Radius 681.231 2649.108 2492.388 3048.532 3873.052 2417.804

Variance

MaxDis

1 2 3 4 5 6

0.503 18.090 16.080 21.608 25.377 18.593

963.406 0.206 0.139 0.207 0.220 0.148

1362.462 2649.414 2492.595 3048.897 3873.670 2417.990

Then we chose cluster number 5 to cluster auto-mpg by the K-means. Its clustering result is presented in the right part of Table 2 (CK). By comparing these two clustering results, we can see that according to the Variance of clusters, the quality of the clustering result by HOV3 with covariance prediction of autompg is better than that produced by K-means (k=5, cluster 1 in Ck is an outlier).

5.5 Cluster Exploration by HOV3 with Complex Linear Transformation


In principle, any linear transformation can be employed in HOV3 if it can separate clusters well. We therefore introduce the complex linear transformation to this process. We again use auto-mpg data set as an example. As shown in Figure 8b, three roughly separated clusters appear there, where the vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95] was obtained from the axes values. Then we adopt cos(M10i) as a prediction, where i is the imaginary unit. The projection of HOV3 with cos(M10i ) is illustrated in Figure 13, where three clusters are separated very well. In the same way, many other linear transformations can be applied to different datasets to obtain well-separated clusters. With the clearly grouped objects or fully separated clusters, it would markedly improve the efficiency of the identification of cluster formation in cluster analysis by visualization.

Figure 13. The data distribution of auto-mpg projected by HOV3 with cos(M*10i ) as the prediction

6 External Cluster Validation by HOV3


In practice, with extremely large sized datasets, it is infeasible to cluster an entire data set within an acceptable time scale. A common solution used in data mining is that, clustering algorithms are first applied to the training (a sampling) subset of data from a database to extract cluster patterns, and then the cluster scheme is assessed to see whether it is suitable for other subsets in the database. This procedure is regarded as external cluster validation [VSA05]. Due to the high computational cost of statistical methods on assessing the consistency of cluster structures between large sized subsets, to achieve this goal by statistical methods is still a challenge in data mining. Based on the assumption that if two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high, we proposed a distribution matching based external cluster validation by HOV3 [ZOZ07a]. The detailed explanation of this approach is presented in the follows.

6.1 Definitions
For precisely explaining our approach, firstly we give some formal definitions as pre-description. Definition 4 (cluster) Let be a database of data points. A cluster C :=(D, L) is a non-empty set D on a label set L, and the ith cluster Ci ={pD, lL| Cj.p: Cj.l=i i>0} where l is the cluster label of p, l{-1, 0, 1,k}, and k is the number of clusters. As the special cases, an outlier point is an element of and with cluster label 1; a non-clustered

element of
where

has a cluster label of 0, i.e., it has not been clustered.


s,

Definition 5 (spy subset) A spy subset


s

is a clustered subset of
s;

produced by a clustering algorithm,


s

={C1,C2,, Ck, CE}, Ci (1 i k) is a cluster in

CE is the outlier set of

A spy subset is used as a visual model to verify the cluster structure in the other partitions in the database .

Definition 6 (target subset) A subset


|
s

is a target subset of

s,

t ={Pt.p

, Pt.lL | Pt.p:Pt.l=0

|=| t|}.

A target subset

is a non-clustered subset of

and has the same size of a spy subset Ps of


s.

. It is

used as a target to investigate the similarity of cluster structure with the spy subset

Definition 7 (overlapping point) A non-clustered point po is called an overlapping point of a cluster Ci, denoted as Ci po iff ( pCi po Ci | po -p| ), where is the threshold distance given by the user. Definition 8 (quasi-cluster) The overlapping point set of cluster Ci is composed as a quasi-cluster of Ci, denoted as Cqi i.e., {po Cqi | Ci po} All overlapping points of Ci is composed a quasi-cluster Cqi of Ci. All overlapping points of Ci are composed of a quasi-cluster Cqi of Ci. Definition 9 (well-separated cluster) A cluster Ci is called a well-separated cluster visually, when it satisfies the condition that (Ci Ps, Cj Ps | pCi: p Cj po i j ). A well-separated cluster Ci in the spy subset implies that no points in Ci are within the threshold distance to any other clusters in the spy subset. Based on above definitions, we present the application of our approach to external cluster validation based on distribution matching by HOV3 as follows. 6.2 The Processing of the Approach
The stages in the application of our approach are summarized in the following 5 steps: 1. Clustering: Firstly, the user applies a clustering algorithm to a randomly selected subset given dataset
s

from the

.
s

2. Cluster Separation: The clustering result of

is introduced and visualized in the HOV3 system. Then

the user manually tunes the weight value of each axis or applies other cluster separation methods by HOV3, such as H-HOV3, separating clusters by statistical measurements, complex linear transformations to separate overlapping clusters. If one or more cluster(s) are separated from the others visually, then the weight values of each axis are recorded as a measure vector M. 3. Data Projection by HOV3: The user samples another observation from points as in subset
t

with the same number of

as a target subset

t.

The clustered subset


3

(now as a spy subset) and its target

are projected together by HOV with vector M to detect the distribution consistency

between s and t. 4. The Generation of Quasi-Clusters: The user gives a threshold , and then according to the definitions 5, 6 and 7, a quasi-cluster Cqi of a separated cluster Ci is computed. Then Cqi is removed from t, and Ci is removed from s. If s has clusters then we go back to step 2, otherwise we proceed to the next step. 5. The Interpretation of Results: The overlapping rate of each cluster-and-quasi-cluster pair is calculated as (Cqi, Ci) = |Cqi| / |Ci|. If the overlapping rate approaches 1, it means cluster Ci and its quasi-clusters Cqi have high similarity, since the amount ratio of the spy subset and the target subset is 1:1. Thus the overlapping analysis is simply transformed into a linear regression analysis, i.e., the points around the line C=Cq. Corresponding to the procedure mentioned above, we give the algorithm of external cluster validation based on distribution matching by HOV3 below in Figure 14.

Figure 14. The algorithm of external cluster validation based on distribution matching by HOV3

In Figure 14, the procedure clusterSeparate responds the axis tuning by the user to separate the clusters in the spy subset, and to gather weight values of axes as a measure vector; the procedure quasiClusterGeneration produces quasi clusters in the target subset corresponding to the clusters in the spy subset.

6.3 The Model of Distribution-Matching Based External Cluster Validation by HOV3


In contrast to statistics-based external cluster validation model illustrated in Figure 1, we show our model of external cluster validation by visualization in Figure 15.

Figure 15. External cluster validation by HOV3

Comparing these two models, we may observe that instead of using a clustering algorithm to cluster another sampling data set, in our model, we use a clustered subset from a database as a visual model to verify the similarity of cluster structures between the model and the other non-clustered subsets from the database. To handle the scalability on resampling datasets, we choose the non-cluster observations with the same size as the clustered subset, and then project them together by HOV3. As a consequence, the user can utilize the well-separated clusters produced by scaling axes in HOV3 as a model to pick out their corresponding quasi-clusters, where the points overlap clusters. Also, instead of using statistical methods to assess the similarity between the two subsets, we simply compute the overlapping rate between the clusters and their quasi-clusters to show their consistency. Compared with the statistics-based validation methods, our method is not only visually intuitive, but also more effective in real applications [ZOZ07a]. Obviously, how to obtain well-separated clusters plays a very important role in the procedure of

external cluster validation by HOV3. Separating clusters from lots of overlapping points is also an aim of this research. Thus the approaches mentioned above such as M-HOV3 [ZOZ07b], HOV3 with statistical measurement and complex linear transformation could be introduced into this process.

6.4 External Cluster Validation with M-HOV3


Separation of clusters from lots of overlapping points manually is often time consuming. We claim that the enhanced separation feature of M-HOV3 can provide improvements not only in efficiency but also in accuracy in dealing with external cluster validation [ZOZ07b]. This is because the combination of zooming and M-HOV3 with the same threshold distance can improve the precision of quasi-cluster data point selection. According to formula (5), zooming in HOV3 can be understood as projecting a data set with a vector, which has the same attribute values, i.e., each mk>1 in equation (5) has the same value. Note that the application of an M-HOV3 would normally shrink the size of patterns in HOV3. Technically, then we choose min(mk)-1 as the zooming vector values, where min(mk) is the non zero minimal value of mk. Thus under the condition of a fixed distance between the closest data points, the scale of patterns in HOV3 is amplified by applying the combination of M-HOV3 and zooming. Thus this combination is formalized in equation (10).

Pj (z 0 ) =

n k =1

[(d

jk

k min d k ) / (max d k min d k ) z 0 (m ik min(m k ) 1 ) s i =1

(10)

We have presented examples on how to gain cluster clues by applying HOV3 projection to databases in the previous sections. In the next section, we demonstrate the effectiveness of external cluster validation by HOV3 by several examples.

7 Examples and Explanation


In this section, we present several examples to demonstrate the advantages of the cluster exploration and external cluster validation by HOV3. We have implemented our approach in MATLAB running under Windows 2000 Professional. The datasets used in the examples are obtained from the UCI machine learning website: http://www.ics.uci.edu/~mlearn/Machine-Learning.html.

7.1 M-HOV3
Choosing the appropriate cluster number of an unknown data set is meaningful in the pre-clustering stage. The enhanced separation feature of M-HOV3 is advantageous in the identification of cluster number in this stage. We demonstrate this advantage of M-HOV3 by the example next. We use data set Boston Housing (simply written as Housing) as an example. The Housing set has 14 attributes and 506 instances. The original data distribution of Housing is given in Figure 16a. As in the process of the last example, based on observation and axis scaling we had a roughly separated data distribution of Housing, as demonstrated in Figure 16b; we fixed the weight values of each axis as M = [0.5, 1, 0, 0.95, 0.3, 1, 0.5, 0.5, 0.8, 0.75, 0.25, 0.55, 0.45, 0.75]. Comparing diagrams of Figure 16a and Figure 16b, we can see that data points in Figure 16b are constricted as 3 (or 4?) groups. Then M-HOV3 was applied to the data set Housing. Figure 16c and Figure 16d are the results of M-HOV3 with M.*M and M.*M.* M correspondingly. So, it is much easier to gain grouping insight from Figure 16c and Figure 16d, where we can identify the group members conveniently.

Figure 16a. The original data distribution of Housing

Figure 16b.

p1=

(Housing, M)

Figure 16c.

p2=

(Housing, M *M)

Figure 16d.

p3==

( , M *M)*M

We believe that with the domain experts involved in the process, the M-HOV3 approach can perform better in real world applications. Now we demonstrate the improvement of precision by applying M-HOV3 with zooming to gain cluster members. We still use the above example, the non zero minimal value of the measure vector M is 0.25, then we use V=[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] (4=1/0.25) as the zooming vector. We contrast applying M-HOV3 with zooming below.

Figure 17

p2=

(Housing, M *M)

Figure 18

p2=

(Housing, M *M *V)

It is observed that the shape of patterns in Figure 17 is exactly the same as in Figure 18, but the scale in Figure 18 is enlarged. Thus the effect of combining M-HOV3 and zooming would improve the accuracy of data selection in external cluster validation by HOV3.

7.2 External Cluster Validation by HOV3

Shuttle data set has 9 attributes and 4,3500 instances. We choose the first 5,000 instances of Shuttle as a sampling data and apply the K-means algorithm [McQ67] to it. Then we utilize the clustered result as a spy subset. We assumed that we have found the optimal cluster number k=5 for the sampling data. The original data distributions with and without cluster indices are illustrated in the diagrams of Figure 19 and Figure 20 respectively. It can be seen that there exists a cluster overlapping in Figure 20.

Figure 19. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV3 (without cluster indices).

Figure 20. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV3 (with cluster indices)

To obtain well-separated clusters, we tuned the weight of each coordinate, and had a satisfied version of the data distribution as shown in Figure 20. The weight values of axes are recorded as a measure vector [0.80, 0.55, 0.85, 0.0, 0.40, 0.95, 0.20, 0.05, 0.459], in this case. Then we chose the second 5,000 instances of Shuttle as a target subset and projected the target subset and the spy subset together against the measure vector by HOV3. Their distributions are presented in Figure 21, where we may observe that their data distributions are matched very well. We chose the points in the enclosed area in Figure 22 as a cluster then obtained a quasi-cluster in the target subset corresponding to the cluster in the enclosed area. In the same way, we can find the other quasi-clusters from the target subset.

Figure 21. A well-separated version of the spy subset distribution of Shuttle

Figure 22. The projection of the spy subset and a target subset of Shuttle by applying a measure vector.

We have done the same experiment on 4 target subsets of Shuttle. The size of each quasi-cluster and its corresponding cluster are listed in Table 3, and their curves of linear regression to the line C=Cq are illustrated in Figure 23.
Table 3 Cluster-QuasiCluster pairs and their overlapping rate

Subset Spy Target1 Target2 Target3 Target4

Cq1/C1 318 278/318=0.8742 279/318=0.8773 280/318=0.8805 261/318=0.8208

Cq2/C2 773 670/773=0.8668 897/773=1.1604 875/773=1.1320 713/773=0.9224

Cq3/C3 513 503/513=0.9805 626/513=1.2203 481/513=0.9376 368/513=0.7173

Cq4/C4 2254 2459/2254=1.0909 2048/2254=0.9086 2093/2254=0.9286 2416/2254=1.0719

Cq5/C5 1142 1123/1142=0.9834 1602/1142=1.4028 1455/1142=1.2741 1169/1142=1.0264

(*At current stage, we collect the quasi-clusters manually, thus Cqi here may exist redundancy and misloading.)

It is observed that the curves are well matched to the line C=Cq, i.e. the overlapping rate between the clusters and their quasi-clusters are high. The standard deviation is a good way to reflect the difference between the two vectors. Thus we have calculated the standard deviation of each Cqi-Ci pairs among the targetk (k=1,..4) and the spy subsets. They are 0.0826, 0.1975, 0.1491 and 0.1304. This means that the similarity of cluster structure in the spy and the target subsets is high. In summary, the experiments show that the same cluster structure in the spy subset of Shuttle also exists in the target subsets of Shuttle.

Figure 23. The curves of linear regression to the line C=Cq.

In these experiments, we have also measured the timing for both clustering and projection in MATLAB. The results are listed in the Table 4.
Table 4 Timing of Clustering and Projecting Clustering by K-mens (k=5) Subset Target 1 Target 2 Target 3 Target 4 Amount 5,000 5,000 5,000 5,000 Time(Second) Subset .532 .61 .656 .453 Projecting by HOV3 Size Time(Second) .11 .109 .11 .109 Syp+Target1 10,000 Syp+Target2 10,000 Syp+Target3 10,000 Syp+Target4 10,000

Based on this calculation, it has been observed that the projection by HOV3 is much faster than the clustering process by the K-means algorithm. It is particularly effective for verifying the clustering results within extremely huge databases. Although the cluster separation in our approach may incur some time, once the well-separated clusters are found, using a measure vector to project a huge data set will be a lot

more efficient than re-applying a clustering algorithm to the data set.

8 Concluding Remarks
In this paper we have proposed a novel approach called HOV3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis of high-dimensional datasets by visualization. The HOV3 visualization technique employs hypothesis-oriented measures to project data and allows users to iteratively adjust the measures for optimizing the result of clusters. This approach provides data miners an opportunity to introduce their quantified domain knowledge as predictions in the cluster discovery process for revealing the gaps of data distribution. HOV3 is a more purposeful visual method to investigate clusters in high-dimensional databases. In this paper, based on the projection technique of HOV3, we have also introduced a visual approach called M-HOV3 to enhance the visual separation of clusters. The visual separability of clusters is significant for cluster analysis. A good visual separation of clusters is not only beneficial in revealing the membership formation of clusters, but also beneficial in verifying the validity of clustering results. With M-HOV3, users can both explore cluster distribution intuitively and deal with cluster validation effectively by matching the geometrical distributions of clustered and non-clustered subsets produced by M-HOV3. Based on the capability of quantified domain knowledge about datasets as predictions/measurements of HOV3, we have also addressed visual external cluster validation supported by the projection mechanism of HOV3. This approach is based on the assumption that by using a measure to project the data sets in the same cluster structure, the similarity of their data distributions should be high. By comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV3 with tunable measures, users can perform intuitive visual evaluation, and also have a precise evaluation of the consistency of the cluster structure by performing geometrical computation on their data distributions as well. By comparing our approach with existing visual methods, we have observed that our method is not only efficient in performance, but also effective in real applications. Experiments show that HOV3 technique can improve the effectiveness of cluster analysis by visualization and provide a better, intuitive understanding of the results. HOV3 can be seen as a bridging process between qualitative analysis and quantitative analysis. It not only supports quantified domain knowledge verification and validation, but also can directly utilize the rich statistical analysis tools as measures and give data miners an efficient and effective guidance to get more precise cluster information in data mining. As a result, with the advantage of the quantified measurement feature of HOV3 data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verifying the membership of data points among the clusters effectively in the post-processing stage of clustering in data mining.

References
[AAP+03] [ABK+99] [AlC91] [Ber06] [BPR+04] A. L. Abul, R. Alhajj, F. Polat and K. Barker Cluster Validity Analysis Using Subsampling, in proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Washington DC, Oct. 2003 Volume 2: pp. 1435-1440. M. Ankerst, M. M. Breunig, H.-P. Kriegel, J.Sander, OPTICS: Ordering points to identify the clustering structure, in proceedings of ACM SIGMOD Conference, 1999 pp. 49-60. Alpern B., Carter L.: Hyperbox. Proc. Visualization 91, San Diego, CA (1991) 133-139 P. Berkhin, A Survey of Clustering Data Mining Techniques Kogan, Jacob; Nicholas, Charles; Teboulle, Marc (Eds.) Grouping Multidimensional Data, Springer Press (2006) 25-72 C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, Subspace Selection for Clustering

[CBC+95] [Che73] [ChL04] [Cle93] [Cli00] [EKS+96] [FaL95] [GRS98] [HaK01] [HBV01] [HBV02] [HKW99] [HCN01] [HKK05] [HuL00] [Ins97] [Jac08] [JaD88] [JMF99] [Kan01] [KeK94] [Koh97] [KSP73] [McQ67] [Mil81] [MSS83] [OlL03] [PGW03]

High-Dimensional Data, Proc. of the Fourth IEEE International Conference on Data Mining (ICDM04), 2004, pp.11-18. Cook D.R., Buja A., Cabrea J., and Hurley H.: Grand tour and projection pursuit. Journal of Computational and Graphical Statistics Volume: 23 (1995) Chernoff H.: The Use of Faces to Represent Points in k-Dimensional Space Graphically. Journal Amer. Statistical Association, Volume: 68 (1973) 361-368 K. Chen and L. Liu,.VISTA: Validating and Refining Clusters via Visualization, Journal of Information Visualization. Volume3 (4), 2004, pp. 257-270. Cleveland W.S.: Visualizing Data. AT&T Bell Laboratories, Murray Hill, NJ, Hobart Press, Summit NJ. (1993) E. Clifford, Data Analysis by Resampling: Concepts and Applications, Duxbury Press, 2000. Ester M., Kriegel HP., Sander J., Xu X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Second International Conference on Knowledge Discovery and Data Mining (1996) C. Faloutsos and K. Lin, Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets Proc. of ACM-SIGMOD, 1995 pp.163- 174. [Fie79] Fienberg S. E.: Graphical methods in statistics. American Statisticians Volume: 33 (1979) 165-178 Guha S., Rastogi R., Shim K.: CURE: An efficient clustering algorithm for large databases. Proc. Of ACM SIGMOD Conference (1998) J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On Clustering Validation Techniques Journal of Intelligent Information Systems, Volume 17 (2/3), 2001, pp. 107145. M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster validity methods: Part I and II, SIGMOD Record, 31, 2002. KA. Hinneburg, D. A. Keim and M. Wawryniuk,. Hd-eye: Visual mining of high-dimensional data. Computer Graphics & Applications Journal, 19(5):2231, September/October 1999. Z. Huang, D. W. Cheung and M. K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, Proceedings of DASFAA01, Hong Kong, April 2001, pp.84-91. J. Handl, J. Knowles, and D. B. Kell, Computational cluster validation in post-genomic data analysis, Journal of Bioinformatics Volume 21(15), 2005, pp. 3201-3212. Z. Huang and T. Lin, A visual method of cluster validation with Fastmap, Proc. of PAKDD-2000, 2000 pp. 153- 164. Inselberg A.: Multidimensional Detective. Proc. of IEEE Information Visualization ' (1997) 100-107 97 Jaccard, S. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, 223 270. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall,1988 A. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Volume 31(3), 1999, pp. 264-323. E. Kandogan, Visualizing multi-dimensional clusters, trends, and outliers using star coordinates, Proc. of ACM SIGKDD Conference, 2001, pp.107-116. Keim D.A. And Kriegel HP.: VisDB: Database Exploration using Multidimensional Visualization. Computer Graphics & Applications (1994) 40-49. T. Kohonen, Self-Organizing Maps Springer, Berlin, second extended edition,1997. S. Kaski, J. Sinkkonen. and J. Peltonen, Data Visualization and Analysis with Self-Organizing Maps in Learning Metrics, DaWaK 2001, LNCS 2114, 2001, pp.162-173. J. McQueen, Some methods for classification and analysis of multivariate observations, Proc. of 5th Berkeley Symposium on Mathematics, Statistics and Probability, Volume 1, 1967, pp. 281-298 G. W. Milligan, A Review Of Monte Carlo Tests Of Cluster Analysis, Journal of Multivariate Behavioral Research Vol. 16( 3), 1981, pp. 379-407. G.W. Milligan, L.M. Sokol, & S.C. Soon The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure, IEEE Trans PAMI, 1983 5(1):40-47. F. Oliveira, H. Levkowitz, From Visual Data Exploration to Visual Data Mining: A Survey, IEEE Trans.Vis.Comput. Graph, Volume 9(3), 2003, pp.378-394. E. Pampalk, W. Goebl, and G. Widmer, Visualizing Changes in the Structure of Data forExploratory

[PiC70]

[Ran 71] [SeS05]

[Shn01] [SCZ98] [ThK99] [VSA05] [Wei98] [WoB94] [ZOZ06] [ZOZ07a] [ZOZ07b] [ZRL96]

Feature Selection, SIGKDD 03, August 24-27, 2003, Washington, DC, USA Pickett R. M.: Visual Analyses of Texture in the Detection and Recognition of Objects. Picture Processing and Psycho-Pictorics, Lipkin B. S., Rosenfeld A. (eds.) Academic Press, New York, (1970) 289-308 Rand, W.M., Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc., 66:846850, 1971. J. Seo and B. Shneiderman, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, Essays Dedicated to Erich J. Neuhold on the Occasion of His 65th Birthday. Lecture Notes in Computer Science Volume 3379, Springer, 2005. B Shneiderman, Inventing Discovery Tools: Combining Information Visualization with Data Mining, Proc. of Discovery Science 2001,Lecture Notes in Computer Science Volume 2226, 2001, pp.17-28. Sheikholeslami G., Chatterjee S., Zhang A.: Wavecluster: A multi-resolution clustering approach for very large spatial databases. Proc. of Very Large Databases Conference (1998) S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press. 1999. Vilalta R., Stepinski T., Achari M.: An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) S. M. Weiss and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann Publishers (1998). P.C. Wong and R. D. Bergeron, 30 Years of Multidimensional Multivariate Visualization, Scientific Visualization, Overviews, Methodologies, and Techniques, IEEE Computer Society, pp.3-33, 1994 K-B. Zhang, M. A. Orgun and K. Zhang, HOV3, An Approach for Cluster Analysis, Proc. of ADMA 2006, XiAn, China, Lecture Notes in Computer Science series, Volume. 4093, 2006, pp317-328 K-B. Zhang, M. A. Orgun, K. Zhang, A Visual Approach for External Cluster Validation, Proc. of IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press, 2007, pp576-582., Montreal, Canada (1996) 103-114 K-B. Zhang, M. A. Orgun, K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China (to appear) Zhang T., Ramakrishnan R. and Livny M.: BIRCH: An efficient data clustering method for very large databases. In Proc. of SIGMOD96

You might also like