Data Mining: The Textbook
Figure 6.8: Illustration of hierarchical clustering steps The generic agglomerative procedure with an unspecified merging criterion is illustrated in Fig. 6.7. The distances are encoded in the nt × nt distance matrix M . This matrix provides the pairwise cluster distances computed with the use of the merging criterion. The different choices for the merging criteria will be described later. The merging of two clusters corresponding to rows (columns) i and j in the matrix M requires the computation of some measure of distances between their constituent objects. For two clusters containing mi and mj objects, respectively, there are mi · mj pairs of distances between constituent objects. For example, in Fig. 6.8b, there are 2 × 4 = 8 pairs of distances between the constituent objects, which are illustrated by the corresponding edges. The overall distance between the two clusters needs to be computed as a function of these mi · mj pairs. In the following, different ways of computing the distances will be discussed. 6.4.1.1 Group-Based Statistics The following discussion assumes that the indices of the two clusters to be merged are denoted by i and j, respectively. In group-based criteria, the distance between two groups of objects is computed as a function of the mi · mj pairs of distances among the constituent objects. The different ways of computing distances between two groups of objects are as follows: Best (single) linkage: In this case, the distance is equal to the minimum distance between all mi · mj pairs of objects. This corresponds to the closest pair of objects between the two groups. After performing the merge, the matrix M of pairwise dis-tances needs to be updated. The ith and jth rows and columns are deleted and replaced with a single row and column representing the merged cluster. The new row (column) can be computed using the minimum of the values in the previously deleted pair of rows (columns) in M . This is because the distance of the other clusters to the merged cluster is the minimum of their distances to the individual clusters in the best-linkage scenario. For any other cluster k = i, j, this is equal to min{Mik, Mjk} (for rows) and min{Mki, Mkj } (for columns). The indices of the rows and columns are then updated to account for the deletion of the two clusters and their replacement with a new one. The best linkage approach is one of the instantiations of agglomerative methods that is very good at discovering clusters of arbitrary shape. This is because the data points in clusters of arbitrary shape can be successively merged with chains of data point pairs at small pairwise distances to each other. On the other hand, such chaining may also inappropriately merge distinct clusters when it results from noisy points. 170 CHAPTER 6. CLUSTER ANALYSIS Worst (complete) linkage: In this case, the distance between two groups of objects is equal to the maximum distance between all mi · mj pairs of objects in the two groups. This corresponds to the farthest pair in the two groups. Correspondingly, the matrix M is updated using the maximum values of the rows (columns) in this case. For any value of k = i, j, this is equal to max{Mik, Mjk} (for rows), and max{Mki, Mkj } (for columns). The worst-linkage criterion implicitly attempts to minimize the maximum diameter of a cluster, as defined by the largest distance between any pair of points in the cluster. This method is also referred to as the complete linkage method. Yüklə 17,13 Mb. Dostları ilə paylaş: |