Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə105/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   101   102   103   104   105   106   107   108   ...   423
1-Data Mining tarjima

Group-average linkage: In this case, the distance between two groups of objects is equal to the average distance between all mi · mj pairs of objects in the groups. To compute the row (column) for the merged cluster in M , a weighted average of the ith and jth rows (columns) in the matrix M is used. For any value of k = i, j, this is

equal to

mi·Mik +mj

·Mjk

(for rows), and

mi·Mki+mj ·Mkj

(for columns).




mi+mj







mi+mj




  1. Closest centroid: In this case, the closest centroids are merged in each iteration. This approach is not desirable, however, because the centroids lose information about the relative spreads of the different clusters. For example, such a method will not discrim-inate between merging pairs of clusters of varying sizes, as long as their centroid pairs are at the same distance. Typically, there is a bias toward merging pairs of larger clusters because centroids of larger clusters are statistically more likely to be closer to each other.




  1. Variance-based criterion: This criterion minimizes the change in the objective function (such as cluster variance) as a result of the merging. Merging always results in a worsening of the clustering objective function value because of the loss of granularity. It is desired to merge clusters where the change (degradation) in the objective function as a result of merging is as little as possible. To achieve this goal, the zeroth, first, and second order moment statistics are maintained with each cluster. The average squared error SEi of the ith cluster can be computed as a function of the number mi of points in the cluster (zeroth-order moment), the sum Fir of the data points in the cluster i along each dimension r (first-order moment), and the squared sum Sir of the data points in the cluster i across each dimension r (second-order moment) according to the following relationship;




d




SEi = (Sir/mi − Fir2 /mi2).

(6.8)

r=1




This relationship can be shown using the basic definition of variance and is used by many clustering algorithms such as BIRCH (cf. Chap. 7). Therefore, for each cluster, one only needs to maintain these cluster-specific statistics. Such statistics are easy to maintain across merges because the moment statistics of a merge of the two clusters i and j can be computed easily as the sum of their moment statistics. Let SEij denote the variance of a potential merge between the two clusters i and j. Therefore, the change in variance on executing a merge of clusters i and j is as follows:





ΔSEij = SEij − SEi − SEj .

(6.9)

This change can be shown to always be a positive quantity. The cluster pair with the smallest increase in variance because of the merge is selected as the relevant pair to



6.4. HIERARCHICAL CLUSTERING ALGORITHMS

171





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   101   102   103   104   105   106   107   108   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin