Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	104/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 100 101 102 103 104 105 106 107 ... 423

1-Data Mining tarjima

6.4. HIERARCHICAL CLUSTERING ALGORITHMS

169

Figure 6.8: Illustration of hierarchical clustering steps

The generic agglomerative procedure with an unspecified merging criterion is illustrated in Fig. 6.7. The distances are encoded in the n_t × n_t distance matrix M . This matrix provides the pairwise cluster distances computed with the use of the merging criterion. The diﬀerent choices for the merging criteria will be described later. The merging of two clusters corresponding to rows (columns) i and j in the matrix M requires the computation of some measure of distances between their constituent objects. For two clusters containing m_i and m_j objects, respectively, there are m_i · m_j pairs of distances between constituent objects. For example, in Fig. 6.8b, there are 2 × 4 = 8 pairs of distances between the constituent objects, which are illustrated by the corresponding edges. The overall distance between the two clusters needs to be computed as a function of these m_i · m_j pairs. In the following, diﬀerent ways of computing the distances will be discussed.

6.4.1.1 Group-Based Statistics

The following discussion assumes that the indices of the two clusters to be merged are denoted by i and j, respectively. In group-based criteria, the distance between two groups of objects is computed as a function of the m_i · m_j pairs of distances among the constituent objects. The diﬀerent ways of computing distances between two groups of objects are as follows:

Best (single) linkage: In this case, the distance is equal to the minimum distance between all m_i · m_j pairs of objects. This corresponds to the closest pair of objects between the two groups. After performing the merge, the matrix M of pairwise dis-tances needs to be updated. The ith and jth rows and columns are deleted and replaced with a single row and column representing the merged cluster. The new row (column) can be computed using the minimum of the values in the previously deleted pair of rows (columns) in M . This is because the distance of the other clusters to the merged cluster is the minimum of their distances to the individual clusters in the best-linkage scenario. For any other cluster k = i, j, this is equal to min{M_ik, M_jk} (for rows) and min{M_ki, M_kj } (for columns). The indices of the rows and columns are then updated to account for the deletion of the two clusters and their replacement with a new one. The best linkage approach is one of the instantiations of agglomerative methods that is very good at discovering clusters of arbitrary shape. This is because the data points in clusters of arbitrary shape can be successively merged with chains of data point pairs at small pairwise distances to each other. On the other hand, such chaining may also inappropriately merge distinct clusters when it results from noisy points.

170 CHAPTER 6. CLUSTER ANALYSIS

Worst (complete) linkage: In this case, the distance between two groups of objects is equal to the maximum distance between all m_i · m_j pairs of objects in the two groups. This corresponds to the farthest pair in the two groups. Correspondingly, the matrix M is updated using the maximum values of the rows (columns) in this case. For any value of k = i, j, this is equal to max{M_ik, M_jk} (for rows), and max{M_ki, M_kj } (for columns). The worst-linkage criterion implicitly attempts to minimize the maximum diameter of a cluster, as defined by the largest distance between any pair of points in the cluster. This method is also referred to as the complete linkage method.

Yüklə 17,13 Mb.

Dostları ilə paylaş: