Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	105/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 101 102 103 104 105 106 107 108 ... 423

1-Data Mining tarjima

Group-average linkage: In this case, the distance between two groups of objects is equal to the average distance between all m_i · m_j pairs of objects in the groups. To compute the row (column) for the merged cluster in M , a weighted average of the ith and jth rows (columns) in the matrix M is used. For any value of k = i, j, this is

equal to	m_i·M_ik +m_j		·M_jk		(for rows), and		m_i·M_ki+m_j ·M_kj		(for columns).
		m_i+m_j						m_i+m_j

Closest centroid: In this case, the closest centroids are merged in each iteration. This approach is not desirable, however, because the centroids lose information about the relative spreads of the diﬀerent clusters. For example, such a method will not discrim-inate between merging pairs of clusters of varying sizes, as long as their centroid pairs are at the same distance. Typically, there is a bias toward merging pairs of larger clusters because centroids of larger clusters are statistically more likely to be closer to each other.

Variance-based criterion: This criterion minimizes the change in the objective function (such as cluster variance) as a result of the merging. Merging always results in a worsening of the clustering objective function value because of the loss of granularity. It is desired to merge clusters where the change (degradation) in the objective function as a result of merging is as little as possible. To achieve this goal, the zeroth, first, and second order moment statistics are maintained with each cluster. The average squared error SE_i of the ith cluster can be computed as a function of the number m_i of points in the cluster (zeroth-order moment), the sum F_ir of the data points in the cluster i along each dimension r (first-order moment), and the squared sum S_ir of the data points in the cluster i across each dimension r (second-order moment) according to the following relationship;

d
SE_i = (S_ir/m_i − F_ir² /m_i²).	(6.8)
r=1

This relationship can be shown using the basic definition of variance and is used by many clustering algorithms such as BIRCH (cf. Chap. 7). Therefore, for each cluster, one only needs to maintain these cluster-specific statistics. Such statistics are easy to maintain across merges because the moment statistics of a merge of the two clusters i and j can be computed easily as the sum of their moment statistics. Let SE_i_∪_j denote the variance of a potential merge between the two clusters i and j. Therefore, the change in variance on executing a merge of clusters i and j is as follows:

ΔSE_i_∪_j = SE_i_∪_j − SE_i − SE_j .

(6.9)

This change can be shown to always be a positive quantity. The cluster pair with the smallest increase in variance because of the merge is selected as the relevant pair to

6.4. HIERARCHICAL CLUSTERING ALGORITHMS		171

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 101 102 103 104 105 106 107 108 ... 423