7.3. SCALABLE DATA CLUSTERING
|
213
|
of k-medoids methods. The BIRCH algorithm is much faster because it is based on the k-means methodology, and its hierarchical clustering structure can be tightly controlled because of its top-down partitioning approach. This can be useful for indexing applications. On the other hand, BIRCH is not designed for arbitrary data types or clusters of arbi-trary shape. The CURE algorithm can determine clusters of arbitrary shape because of its bottom-up hierarchical approach. The choice of the most suitable algorithm depends on the application at hand. This section will provide an overview of these different methods.
7.3.1 CLARANS
The CLARA and CLARANS methods are two generalizations of the k-medoids approach to clustering. Readers are referred to Sect. 6.3.4 of the previous chapter for a description of the generic k-medoids approach. Recall that the k-medoids approach works with a set of representatives, and iteratively exchanges one of the medoids with a non-medoid in each iteration to improve the clustering quality. The generic k -medoids algorithm allows consid-erable flexibility in deciding how this exchange might be executed.
The C lustering LARge Applications (CLARA) method is based on a particular instan-tiation of the k-medoids method known as Partitioning Around Medoids (PAM). In this method, to exchange a medoid with another non-medoid representative, all possible k·(n−k) pairs are tried for a possible exchange to improve the clustering objective function. The best improvement of these pairs is selected for an exchange. This exchange is performed until the algorithm converges to a locally optimal value. The exchange process requires O(k · n2) dis-tance computations. Therefore, each iteration requires O(k · n2 · d) time for a d-dimensional data set, which can be rather expensive. Because the complexity is largely dependent on the number of data points, it can be reduced by applying the algorithm to a smaller sample. Therefore, the CLARA approach applies PAM to a smaller sampled data set of size f · n to discover the medoids. The value of f is a sampling fraction, which is much smaller than 1. The remaining nonsampled data points are assigned to the optimal medoids discovered by applying PAM to the smaller sample. This overall approach is applied repeatedly over inde-pendently chosen samples of data points of the same size f ·n. The best clustering over these independently chosen samples is selected as the optimal solution. Because the complexity of each iteration is O(k · f 2 · n2 · d + k · (n − k)), the approach may be orders of magnitude faster for small values of the sampling fraction f . The main problem with CLARA occurs when each of the preselected samples does not include good choices of medoids.
The Clustering L arge Applications based on RANdomized Search (CLARANS) approach works with the full data set for the clustering in order to avoid the problem with prese-lected samples. The approach iteratively attempts exchanges between random medoids with random non-medoids. After a randomly chosen non-medoid is tried for an exchange with a randomly chosen medoid, it is checked if the quality improves. If the quality does improve, then this exchange is made final. Otherwise, the number of unsuccessful exchange attempts is counted. A local optimal solution is said to have been found when a user-specified number of unsuccessful attempts MaxAttempt have been reached. This entire process of finding the local optimum is repeated for a user-specified number of iterations, denoted by MaxLocal. The clustering objective function of each of these MaxLocal locally optimal solutions is eval-uated. The best among these local optima is selected as the optimal solution. The advantage of CLARANS over CLARA is that a greater diversity of the search space is explored.
214 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS
Figure 7.1: The CF-Tree
7.3.2 BIRCH
The Balanced I terative Reducing and Clustering using Hierarchies (BIRCH) approach can be viewed as a combination of top-down hierarchical and k-means clustering. To achieve this goal, the approach introduces a hierarchical data structure, known as the CF-Tree. This is a height-balanced data structure organizing the clusters hierarchically. Each node has a branching factor of at most B, which corresponds to its (at most) B children subclusters. This structure shares a resemblance to the B-Tree data structure commonly used in database indexing. This is by design because the CF-Tree is inherently designed to support dynamic insertions into the hierarchical clustering structure. An example of the CF-Tree is illustrated in Fig. 7.1.
Each node contains a concise summary of each of the at most B subclusters that it points to. This concise summary of a cluster is referred to as its cluster feature (CF), or cluster feature vector. The summary contains the triple (SS, LS, m), where SS is a vector1 containing the sum of the square of the points in the cluster (second-order moment), LS is a vector containing the linear sum of the points in the cluster (first-order moment), and m is the number of points in the cluster (zeroth-order moment). Thus, the size of the summary is (2 · d + 1) for a d-dimensional data set and is also referred to as a CF-vector. The cluster feature vector thus contains all moments of order at most 2. This summary has two very important properties:
Each cluster feature can be represented as a linear sum of the cluster features of the individual data points. Furthermore, the cluster feature of a parent node in the CF-Tree is the sum of the cluster features of its children. The cluster feature of a merged cluster can also be computed as the sum of the cluster features of the constituent clusters. Therefore, incremental updates of the cluster feature vector can be efficiently achieved by adding the cluster feature vector of a data point to that of the cluster.
The cluster features can be used to compute useful properties of a cluster, such as its radius and centroid. Note that these are the only two computations required by a centroid-based algorithm, such as k-means or BIRCH. These computations are dis-cussed below.
To understand how the cluster feature can be used to measure the radius of a cluster, consider a set of data points denoted by X1 . . . Xm, where Xi = (x1i . . . xdi). The mean and
It is possible to store the sum of the values in SS across the d dimensions in lieu of SS, without affecting the usability of the cluster feature. This would result in a cluster feature of size (d + 2) instead of (2 · d + 1).
Dostları ilə paylaş: |