Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə133/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   129   130   131   132   133   134   135   136   ...   423
1-Data Mining tarjima

7.4. HIGH-DIMENSIONAL CLUSTERING

217

criterion in agglomerative hierarchical algorithms. The merging can be performed until the number of remaining clusters is equal to k. The value of k is an input parameter specified by the user. CURE can handle outliers by periodically eliminating small clusters during the merging process. The idea here is that the clusters remain small because they contain mostly outliers.


To further improve the complexity, the CURE algorithm draws a random sample from the underlying data, and performs the clustering on this random sample. In a final phase of the algorithm, all the data points are assigned to one of the remaining clusters by choosing the cluster with the closest representative data point.


Larger sample sizes can be efficiently used with a partitioning trick. In this case, the sample is further divided into a set of p partitions. Each partition is hierarchically clustered until a desired number of clusters is reached, or some merging quality criterion is met. These intermediate clusters (across all partitions) are then reclustered together hierarchically to create the final set of k clusters from the sampled data. The final assignment phase is applied to the representatives of the resulting clusters. Therefore, the overall process may be described by the following steps:





  1. Sample s points from the database D of size n.




  1. Divide the sample s into p partitions of size s/p each.




  1. Cluster each partition independently using the hierarchical merging to k clusters in each partition. The overall number k · p of clusters across all partitions is still larger than the user-desired target k.




  1. Perform hierarchical clustering over the k · p clusters derived across all partitions to the user-desired target k.




  1. Assign each of the (n − s) nonsample data points to the cluster containing the closest representative.

The CURE algorithm is able to discover clusters of arbitrary shape unlike other scalable methods such as BIRCH and CLARANS. Experimental results have shown that CURE is also faster than these methods.


7.4 High-Dimensional Clustering


High-dimensional data contain many irrelevant features that cause noise in the clustering process. The feature selection section of the previous chapter discussed how the irrelevant features may be removed to improve the quality of clustering. When a large number of features are irrelevant, the data cannot be separated into meaningful and cohesive clusters. This scenario is especially likely to occur when features are uncorrelated with one another. In such cases, the distances between all pairs of data points become very similar. The corresponding phenomenon is referred to as the concentration of distances.


The feature selection methods discussed in the previous chapter can reduce the detri-mental impact of the irrelevant features. However, it may often not be possible to remove any particular set of features a priori when the optimum choice of features depends on the underlying data locality. Consider the case illustrated in Fig. 7.2a. In this case, cluster A exists in the XY-plane, whereas cluster B exists in the YZ -plane. Therefore, the feature relevance is local, and it is no longer possible to remove any feature globally without losing


218 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS


Figure 7.2: Illustration of axis-parallel and arbitrarily oriented (correlated) projected clusters


relevant features for some of the data localities. The concept of projected clustering was introduced to address this issue.

In conventional clustering methods, each cluster is a set of points. In projected clustering, each cluster is defined as a set of points together with a set of dimensions (or subspace). For example, the projected cluster A in Fig. 7.2a would be defined as its relevant set of points, together with the subspace corresponding to the X and Y dimensions. Similarly, the projected cluster B in Fig. 7.2a is defined as its relevant set of points, together with the subspace corresponding to the Y and Z dimensions. Therefore, a projected cluster is defined as the pair (Ci, Ei), where Ci is a set of points, and the subspace Ei is the subspace defined by a set of dimensions.


An even more challenging situation is illustrated in Fig. 7.2b in which the clusters do not exist in axis-parallel subspaces, but they exist in arbitrarily oriented subspaces of the data. This problem is also a generalization of the principal component analysis (PCA) method discussed in Chap. 2, where a single global projection with the largest variance is found to retain the greatest information about the data. In this case, it is desired to retain the best local projections with the least variance to determine the subspaces in which each set of data points is tightly clustered. These types of clusters are referred to as





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   129   130   131   132   133   134   135   136   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin