CLUSTER A
|
CLUSTER A
|
|
|
(ARBITRARY SHAPE)
|
(ARBITRARY SHAPE)
|
|
|
SUCCESSIVE SINGLE
|
SUCCESSIVE SINGLE
|
|
|
LINKAGE MERGES
|
|
|
LINKAGE MERGES
|
ALONG NOISY
|
|
|
WILL DISCOVER
|
BRIDGE MIGHT
|
|
|
CORRECT CLUSTERS
|
PREMATURELY
|
|
|
|
CHAIN DISTINCT
|
|
|
CLUSTER B
|
CLUSTERS
|
|
|
CLUSTER B
|
|
|
(a) Good case with no noise
|
(b) Bad case with noise
|
|
|
Figure 6.9: Good and bad cases for single-linkage clustering
be merged. As before, a matrix M of pairwise values of ΔSEi∪j is maintained along with moment statistics. After each merge of the ith and jth clusters, the ith and jth rows and columns of M are deleted and a new column for the merged cluster is added. The kth row (column) entry (k = i, j) in M of this new column is equal to SEi∪j∪k − SEi∪j − SE k. These values are computed using the cluster moment statistics. After computing the new row and column, the indices of the matrix M are updated to account for its reduction in size.
Ward’s method: Instead of using the change in variance, one might also use the (unscaled) sum of squared error as the merging criterion. This is equivalent to setting
the RHS of Eq. 6.8 to
|
d
|
(miSir − Fir2 ). Surprisingly, this approach is a variant of
|
|
r=1
|
|
the centroid method. The objective function for merging is obtained by multiplying the (squared) Euclidean distance between centroids with the harmonic mean of the number of points in each of the pair. Because larger clusters are penalized by this additional factor, the approach performs more effectively than the centroid method.
The various criteria have different advantages and disadvantages. For example, the single linkage method is able to successively merge chains of closely related points to discover clusters of arbitrary shape. However, this property can also (inappropriately) merge two unrelated clusters, when the chaining is caused by noisy points between two clusters. Exam-ples of good and bad cases for single-linkage clustering are illustrated in Figs. 6.9a and b, respectively. Therefore, the behavior of single-linkage methods depends on the impact and relative presence of noisy data points. Interestingly, the well-known DBSCAN algorithm (cf. Sect. 6.6.2) can be viewed as a robust variant of single-linkage methods, and it can therefore find arbitrarily shaped clusters. The DBSCAN algorithm excludes the noisy points between clusters from the merging process to avoid undesirable chaining effects.
The complete (worst-case) linkage method attempts to minimize the maximum distance between any pair of points in a cluster. This quantification can implicitly be viewed as an approximation of the diameter of a cluster. Because of its focus on minimizing the diameter, it will try to create clusters so that all of them have a similar diameter. However, if some of the natural clusters in the data are larger than others, then the approach will break up the larger clusters. It will also be biased toward creating clusters of spherical shape irrespective of the underlying data distribution. Another problem with the complete linkage method is that it gives too much importance to data points at the noisy fringes of a cluster because of its focus on the maximum distance between any pair of points in the cluster. The group-average, variance, and Ward’s methods are more robust to noise due to the use of multiple linkages in the distance computation.
172 CHAPTER 6. CLUSTER ANALYSIS
The agglomerative method requires the maintenance of a heap of sorted distances to efficiently determine the minimum distance value in the matrix. The initial distance matrix computation requires O(n2 · d) time, and the maintenance of a sorted heap data structure requires O(n2 · log(n)) time over the course of the algorithm because there will be a total of O(n2) additions and deletions into the heap. Therefore, the overall running time is O(n2 · d + n2 · log(n)). The required space for the distance matrix is O (n2). The space-requirement is particularly problematic for large data sets. In such cases, a similarity matrix M cannot be incrementally maintained, and the time complexity of many hierarchical methods will increase dramatically to O(n3 · d). This increase occurs because the similarity computations between clusters need to be performed explicitly at the time of the merging. Nevertheless, it is possible to speed up the algorithm in such cases by approximating the merging criterion. The CURE method, discussed in Sect. 7.3.3 of Chap. 7, provides a scalable single-linkage implementation of hierarchical methods and can discover clusters of arbitrary shape. This improvement is achieved by using carefully chosen representative points from clusters to approximately compute the single-linkage criterion.
Practical Considerations
Agglomerative hierarchical methods naturally lead to a binary tree of clusters. It is gen-erally difficult to control the structure of the hierarchical tree with bottom-up methods as compared to the top-down methods. Therefore, in cases where a taxonomy of a specific structure is desired, bottom-up methods are less desirable.
A problem with hierarchical methods is that they are sensitive to a small number of mistakes made during the merging process. For example, if an incorrect merging decision is made at some stage because of the presence of noise in the data set, then there is no way to undo it, and the mistake may further propagate in successive merges. In fact, some variants of hierarchical clustering, such as single-linkage methods, are notorious for successively merging neighboring clusters because of the presence of a small number of noisy points. Nevertheless, there are numerous ways to reduce these effects by treating noisy data points specially.
Agglomerative methods can become impractical from a space- and time-efficiency per-spective for larger data sets. Therefore, these methods are often combined with sampling and other partitioning methods to efficiently provide solutions of high quality.
6.4.2 Top-Down Divisive Methods
Although bottom -up agglomerative methods are typically distance-based methods, top-down hierarchical methods can be viewed as general-purpose meta-algorithms that can use almost any clustering algorithm as a subroutine. Because of the top-down approach, greater control is achieved on the global structure of the tree in terms of its degree and balance between different branches.
The overall approach for top-down clustering uses a general -purpose flat-clustering algo-rithm A as a subroutine. The algorithm initializes the tree at the root node containing all the data points. In each iteration, the data set at a particular node of the current tree is split into multiple nodes (clusters). By changing the criterion for node selection, one can create trees balanced by height or trees balanced by the number of clusters. If the algorithm A is randomized, such as the k-means algorithm (with random seeds), it is possible to use multiple trials of the same algorithm at a particular node and select the best one. The generic pseudocode for a top-down divisive strategy is illustrated in Fig. 6.10. The algo-
Dostları ilə paylaş: |