Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə136/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   132   133   134   135   136   137   138   139   ...   423
1-Data Mining tarjima

Determination of bad medoids: The determination of “bad” medoids from Sbest is per-
formed as follows: The medoid of the cluster with the least number of points is bad. In addition, the medoid of any cluster with less than (n/k) · minDeviation points is bad, where minDeviation is a constant smaller than 1. The typical value was set to 0.1. The assumption here is that bad medoids have small clusters either because they are outliers or because they share points with another cluster. The bad medoids are replaced with random points from the candidate medoid set M .


Refinement phase: After the best set of medoids is found, a final pass is performed over the data to improve the quality of the clustering. The dimensions associated with each medoid are computed differently than in the iterative phase. The main difference is that to analyze the dimensions associated with each medoid, the distribution of the points in the clusters at the end of the iterative phase is used, as opposed to the localities of the medoids. After the new dimensions have been computed, the points are reassigned to medoids based on the Manhattan segmental distance with respect to the new set of dimensions. Outliers are also handled during this final pass over the data. For each medoid i, its closest other medoid is computed using the Manhattan segmental distance in the relevant subspace of medoid i. The corresponding distance is referred to as its sphere of influence. If the Man-hattan segmental distance of a data point to each medoid is greater than the latter’s sphere of influence, then the data point is discarded as an outlier.
7.4.3 ORCLUS

The arbitrarily ORiented projected CLUStering (ORCLUS) algorithm finds clusters in arbi-trarily oriented subspaces, as illustrated in Fig. 7.2b. Clearly, such clusters cannot be found by axis-parallel projected clustering. Such clusters are also referred to as correlation clus-ters. The algorithm uses the number of clusters k, and the dimensionality l of each subspace Ei as an input parameter. Therefore, the algorithm returns k different pairs (Ci, Ei), where the clusterCi is defined in the arbitrarily oriented subspace Ei. In addition, the algorithm reports a set of outliers O. This method is also referred to as correlation clustering. Another difference between the PROCLUS and ORCLUS models is the simplifying assumption in the latter that the dimensionality of each subspace is fixed to the same value l. In the former case, the value of l is simply the average dimensionality of the cluster-specific subspaces.


The ORCLUS algorithm uses a combination of hierarchical and k-means clustering in conjunction with subspace refinement. While hierarchical merging algorithms are generally more effective, they are expensive. Therefore, the algorithm uses hierarchical representatives that are successively merged. The algorithm starts with kc = k0 initial seeds, denoted by S.



7.4. HIGH-DIMENSIONAL CLUSTERING

223

Algorithm ORCLUS(Data: D, Clusters: k, Dimensions: l)


begin
Sample set S of k0 > k points from D;


kc = k0; lc = d;
Set each Ei to the full data dimensionality;
α = 0.5; β = elog(d/l)·log(1)/log(k0/k);
while (kc > k) do
begin
Assign each data point in D to closest seed in S using projected distance in Ei to create Ci;
Re-center each seed in S to centroid of cluster Ci;

Use PCA to determine subspace Ei associated with Ci by selecting smallest lc eigenvectors of covariance matrix of Ci;



Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   132   133   134   135   136   137   138   139   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin