Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə135/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   131   132   133   134   135   136   137   138   ...   423
1-Data Mining tarjima

7.4. HIGH-DIMENSIONAL CLUSTERING

221

Algorithm PROCLUS(Database: D, Clusters: k, Dimensions: l)


begin
Select candidate medoids M ⊆ D with a farthest distance approach; S = Random subset of M of size k;


BestObjective = ;

repeat
Compute dimensions (subspace) associated with each medoid in S;


Assign points in D to closest medoids in S using projected distance;




CurrentObjective = Mean projected distance of points to cluster centroids;

if (CurrentObjective < BestObjective) then begin


Sbest = S;


BestObjective = CurrentObjective;

end;

Recompute S by replacing bad medoids in Sbest with random points from M ; until termination criterion;

Assign data points to medoids in Sbest using refined subspace computations; return all cluster-subspace pairs;


end



Figure 7.4: The PROCLUS algorithm
data points to medoids by computing the distance of the data point to each medoid i in the subspace Ei relevant to the ith medoid. First, the locality of each medoid in S is defined. The locality of the medoid is defined as the set of data points that lies in a sphere of radius equal to the distance to the closest medoid. The (statistically normalized) average distance along each dimension from the medoid to the points in its locality is computed. Let rij be the average distance of the data points in the locality of medoid i to medoid i along



dimension j. The mean μi =

d

rij /d and standard deviation σ i =




d

(rij −μi)

2




of







j=1










j=1







d −1










these distance values rij are computed, specific to each locality. This can then be converted




into a statistically normalized value zij :


































zij =

rij − μi

.







(7.10)













σi






















The reason for this locality -specific normalization is that different data localities have differ-ent natural sizes, and it is difficult to compare dimensions from different localities without normalization. Negative values of zij are particularly desirable because they suggest smaller average distances than expectation for a medoid-dimension pair. The basic idea is to select the smallest (most negative) k · l values of zij to determine the relevant cluster- specific dimensions. Note that this may result in the assignment of a different number of dimen-sions to the different clusters. The sum of the total number of dimensions associated with the different medoids must be equal to k · l . An additional constraint is that the number of dimensions associated with a medoid must be at least 2. To achieve this, all the zij values are sorted in increasing order, and the two smallest ones are selected for each medoid i. Then, the remaining k · (l − 2) medoid-dimension pairs are greedily selected as the smallest ones from the remaining values of zij .


222 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS




Assignment of data points to clusters and cluster evaluation: Given the medoids and their associated sets of dimensions, the data points are assigned to the medoids using a single pass over the database. The distance of the data points to the medoids is computed using the Manhattan segmental distance. The Manhattan segmental distance is the same as the Manhattan distance, except that it is normalized for the varying number of dimensions associated with each medoid. To compute this distance, the Manhattan distance is com-puted using only the relevant set of dimensions, and then divided by the number of relevant dimensions. A data point is assigned to the medoid with which it has the least Manhattan segmental distance. After determining the clusters, the objective function of the clustering is evaluated as the average Manhattan segmental distance of data points to the centroids of their respective clusters. If the clustering objective improves, then Sbest is updated.



Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   131   132   133   134   135   136   137   138   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin