Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə137/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   133   134   135   136   137   138   139   140   ...   423
1-Data Mining tarjima

kc = max{k, kc · α}; lc = max{l, lc · β};
Repeatedly merge clusters to reduce number of clusters to the new reduced value of kc;

end;

Perform final assignment pass of points to clusters; return cluster-subspace pairs (Ci, Ei) for each i ∈ {1 . . . k};
end

Figure 7.5: The ORCLUS algorithm


The current number of seeds, kc, are reduced over successive merging iterations. Methods from representative-based clustering are used to assign data points to these seeds, except that the distance of a data point to its seed is measured in its associated subspace Ei. Ini-tially, the current dimensionality, lc, of each cluster is equal to the full data dimensionality. The value lc is reduced gradually to the user-specified dimensionality l by successive reduc-tion over different iterations. The idea behind this gradual reduction is that in the first few iterations, the clusters may not necessarily correspond very well to the natural lower dimensional subspace clusters in the data; so a larger subspace is retained to avoid loss of information. In later iterations, the clusters are more refined, and therefore subspaces of lower rank may be extracted.


The overall algorithm consists of a number of iterations, in each of which a sequence of merging operations is alternated with a k-means style assignment with projected distances. The number of current clusters is reduced by the factor α < 1, and the dimensionality of current cluster Ci is reduced by β < 1 in a given iteration. The first few iterations correspond to a higher dimensionality, and each successive iteration continues to peel off more and more noisy subspaces for the different clusters. The values of α and β are related in such a way that the reduction from k0 to k clusters occurs in the same number of iterations as the reduction from l0 = d to l dimensions. The value of α is 0.5, and the derived value of β is indicated in Fig. 7.5. The overall description of the algorithm is also illustrated in this figure.


The overall procedure uses the three alternating steps of assignment, subspace recom-putation, and merging in each iteration. Therefore, the algorithm uses concepts from both hierarchical and k -means methods in conjunction with subspace refinement. The assign-ment step assigns each data point to its closest seed, by comparing the projected distance


224 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS


of a data point to the ith seed in S, using the subspace Ei. After the assignment, all the seeds in S are re-centered to the centroids of the corresponding clusters. At this point, the subspace Ei of dimensionality lc associated with each cluster Ci is computed. This is done by using PCA on cluster Ci . The subspace Ei is defined by the lc orthonormal eigenvectors of the covariance matrix of cluster Ci with the least eigenvalues. To perform the merging, the algorithm computes the projected energy (variance) of the union of the two clusters in the corresponding least spread subspace. The pair with the least energy is selected to perform the merging. Note that this is a subspace generalization of the variance criterion for hierarchical merging algorithms (see Sect. 6.4.1 of Chap. 6).


The algorithm terminates when the merging process over all the iterations has reduced the number of clusters to k. At this point, the dimensionality lc of the subspace Ei associated with each cluster Ci is also equal to l. The algorithm performs one final pass over the database to assign data points to their closest seed based on the projected distance. Outliers are handled during the final phase. A data point is considered an outlier when its projected distance to the closest seed i is greater than the projected distance of other seeds to seed i in subspace Ei.


A major computational challenge is that the merging technique requires the computation of the eigenvectors of the union of clusters, which can be expensive. To efficiently perform the merging, the ORCLUS approach extends the concept of cluster feature vectors from BIRCH to covariance matrices. The idea is to store not only the moments in the cluster feature vector but also the sum of the products of attribute values for each pair of dimensions. The covariance matrix can be computed from this extended cluster feature vector. This approach can be viewed as a covariance- and subspace-based generalization of the variance-based merging implementation of Sect. 6.4.1 in Chap. 6. For details on this optimization, the reader is referred to the bibliographic section.


Depending on the value of k0 chosen, the time complexity is dominated by either the merges or the assignments. The merges require eigenvector computation, which can be expensive. With an efficient implementation based on cluster feature vectors, the merges can be implemented in O(k02 ·d· (k0 +d2)) time, whereas the assignment step always requires O(k0 · n · d) time. This can be made faster with the use of optimized eigenvector compu-tations. For smaller values of k0, the computational complexity of the method is closer to k -means, whereas for larger values of k0, the complexity is closer to hierarchical meth-ods. The ORCLUS algorithm does not assume the existence of an incrementally updatable similarity matrix, as is common with bottom-up hierarchical methods. At the expense of additional space, the maintenance of such a similarity matrix can reduce the O(k03 · d) term to O(k02 · log(k0) · d).


7.5 Semisupervised Clustering


One of the challenges with clustering is that a wide variety of alternative solutions may be found by various algorithms. The quality of these alternative clusterings may be ranked differently by different internal validation criteria depending on the alignment between the clustering criterion and validation criterion. This is a major problem with any unsupervised algorithm. Semisupervision, therefore, relies on external application-specific criteria to guide the clustering process.


It is important to understand that different clusterings may not be equally useful from an application-specific perspective. The utility of a clustering result is, after all, based on the ability to use it effectively for a given application. One way of guiding the clustering results




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   133   134   135   136   137   138   139   140   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin