Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	103/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 99 100 101 102 103 104 105 106 ... 423

1-Data Mining tarjima

6.4. HIERARCHICAL CLUSTERING ALGORITHMS

167

Figure 6.6: Multigranularity insights from hierarchical clustering

created by a manual volunteer eﬀort, but it nevertheless provides a good understanding of the multigranularity insights that may be obtained with such an approach. A small portion of the hierarchical organization is illustrated in Fig. 6.6. At the highest level, the Web pages are organized into topics such as arts, science, health, and so on. At the next level, the topic of science is organized into subtopics, such as biology and physics, whereas the topic of health is divided into topics such as fitness and medicine. This organization makes manual browsing very convenient for a user, especially when the content of the clusters can be described in a semantically comprehensible way. In other cases, such hierarchical organizations can be used by indexing algorithms. Furthermore, such methods can sometimes also be used for creating better “flat” clusters. Some agglomerative hierarchical methods and divisive methods, such as bisecting k-means, can provide better quality clusters than partitioning methods such as k-means, albeit at a higher computational cost.

There are two types of hierarchical algorithms, depending on how the hierarchical tree of clusters is constructed:

Bottom-up (agglomerative) methods: The individual data points are successively agglomerated into higher-level clusters. The main variation among the diﬀerent meth-ods is in the choice of objective function used to decide the merging of the clusters.

Top-down (divisive) methods: A top-down approach is used to successively partition the data points into a tree-like structure. A flat clustering algorithm may be used for the partitioning in a given step. Such an approach provides tremendous flexibility in terms of choosing the trade-oﬀ between the balance in the tree structure and the balance in the number of data points in each node. For example, a tree-growth strategy that splits the heaviest node will result in leaf nodes with a similar number of data points in them. On the other hand, a tree-growth strategy that constructs a balanced tree structure with the same number of children at each node will lead to leaf nodes with varying numbers of data points.

In the following sections, both types of hierarchical methods will be discussed.

6.4.1 Bottom-Up Agglomerative Methods

In bottom-up methods, the data points are successively agglomerated into higher level clus-ters. The algorithm starts with individual data points in their own clusters and successively

168 CHAPTER 6. CLUSTER ANALYSIS

Algorithm AgglomerativeMerge(Data: D)

begin
Initialize n × n distance matrix M using D;

repeat
Pick closest pair of clusters i and j using M ;

Merge clusters i and j;

Delete rows/columns i and j from M and create a new row and column for newly merged cluster;

Update the entries of new row and column of M ; until termination criterion; return current merged cluster set;
end

Figure 6.7: Generic agglomerative merging algorithm with unspecified merging criterion

agglomerates them into higher level clusters. In each iteration, two clusters are selected that are deemed to be as close as possible. These clusters are merged and replaced with a newly created merged cluster. Thus, each merging step reduces the number of clusters by

Therefore, a method needs to be designed for measuring proximity between clusters con-taining multiple data points, so that they may be merged. It is in this choice of computing the distances between clusters, that most of the variations among diﬀerent methods arise.

Let n be the number of data points in the d-dimensional database D, and n_t = n − t be the number of clusters after t agglomerations. At any given point, the method maintains an n_t ×n_t distance matrix M between the current clusters in the data. The precise methodology for computing and maintaining this distance matrix will be described later. In any given iteration of the algorithm, the (nondiagonal) entry in the distance matrix with the least distance is selected, and the corresponding clusters are merged. This merging will require the distance matrix to be updated to a smaller (n_t −1)×(n_t −1) matrix. The dimensionality reduces by 1 because the rows and columns for the two merged clusters need to be deleted, and a new row and column of distances, corresponding to the newly created cluster, needs to be added to the matrix. This corresponds to the newly created cluster in the data. The algorithm for determining the values of this newly created row and column depends on the cluster-to-cluster distance computation in the merging procedure and will be described later. The incremental update process of the distance matrix is a more eﬃcient option than that of computing all distances from scratch. It is, of course, assumed that suﬃcient memory is available to maintain the distance matrix. If this is not the case, then the distance matrix will need to be fully recomputed in each iteration, and such agglomerative methods become less attractive. For termination, either a maximum threshold can be used on the distances between two merged clusters or a minimum threshold can be used on the number of clusters at termination. The former criterion is designed to automatically determine the natural number of clusters in the data but has the disadvantage of requiring the specification of a quality threshold that is hard to guess intuitively. The latter criterion has the advantage of being intuitively interpretable in terms of the number of clusters in the data. The order of merging naturally creates a hierarchical tree-like structure illustrating the relationship between diﬀerent clusters, which is referred to as a dendrogram. An example of a dendrogram on successive merges on six data points, denoted by A, B, C, D, E, and F, is illustrated in Fig. 6.8a.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 99 100 101 102 103 104 105 106 ... 423