Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	134/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 130 131 132 133 134 135 136 137 ... 423

1-Data Mining tarjima

arbitrarily oriented projected clusters, generalized projected clusters, or correlation clusters. Thus, the subspace E_i for each cluster C_i cannot be described in terms of the original set of dimensions. Furthermore, the orthogonal subspace to E_i provides the subspace for performing local dimensionality reduction. This is an interesting problem in its own right. Local dimensionality reduction provides enhanced reduction of data dimensionality because of the local selection of the subspaces for dimensionality reduction.

This problem has two diﬀerent variations, which are referred to as subspace clustering and projected clustering, respectively.

Subspace clustering: In this case, overlaps are allowed among the points drawn from the diﬀerent clusters. This problem is much closer to pattern mining, wherein the asso-ciation patterns are mined from the numeric data after discretization. Each pattern therefore corresponds to a hypercube within a subspace of the numeric data, and the data points within this cube represent the subspace cluster. Typically, the number of subspace clusters mined can be very large, depending upon a user-defined parameter, known as the density threshold.

7.4. HIGH-DIMENSIONAL CLUSTERING

219

Algorithm CLIQUE(Data: D, Ranges: p, Density: τ )

begin
Discretize each dimension of data set D into p ranges;

Determine dense combinations of grid cells at minimum support τ using any frequent pattern mining algorithm;

Create graph in which dense grid combinations are connected if they are adjacent;

Determine connected components of graph;

return (point set, subspace) pair for each connected component;

end

Figure 7.3: The CLIQUE algorithm

Projected clustering: In this case, no overlaps are allowed among the points drawn from the diﬀerent clusters. This definition provides a concise summary of the data. Therefore, this model is much closer, in principle, to the original goals of the clustering framework of data summarization.

In this section, three diﬀerent clustering algorithms will be described. The first of these is CLIQUE, which is a subspace clustering method. The other two are PROCLUS and ORCLUS, which are the first projected clustering methods proposed for the axis-parallel and the correlated versions of the problem, respectively.

7.4.1 CLIQUE

The CLustering In QUEst (CLIQUE) technique is a generalization of grid-based methods discussed in the previous chapter. The input to the method is the number of grid ranges p for each dimension, and the density τ . This density τ represents the minimum number of data points in a dense grid cell and can also be viewed as a minimum support requirement of the grid cell. As in all grid-based methods, the first phase of discretization is used to create a grid structure. In full-dimensional grid-based methods, the relevant dense regions are based on the intersection of the discretization ranges across all dimensions. The main diﬀerence of CLIQUE from these methods is that it is desired to determine the ranges only over a relevant subset of dimensions with density greater than τ . This is the same as the frequent pattern mining problem, where each discretized range is treated as an “item,” and the support is set to τ . In the original CLIQUE algorithm, the Apriori method was used, though any other frequent pattern mining method could be used in principle. As in generic grid-based methods, the adjacent grid cells (defined on the same subspace) are put together. This process is also identical to the generic grid-based methods, except that two grids have to be defined on the same subspace for them to even be considered for adjacency. All the found patterns are returned together with the data points in them. The CLIQUE algorithm is illustrated in Fig. 7.3. An easily understandable description can also be generated for each set of k-dimensional connected grid regions by decomposing it into a minimal set of k-dimensional hypercubes. This problem is NP-hard. Refer to the bibliographic notes for eﬃcient heuristics.

Strictly speaking, CLIQUE is a quantitative frequent pattern mining method rather than a clustering method. The output of CLIQUE can be very large and can sometimes be greater than the size of the data set, as is common in frequent pattern mining. Clustering

220 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

and frequent pattern mining are related but diﬀerent problems with diﬀerent objectives. The primary goal of frequent pattern mining is that of finding dimension correlation, whereas the primary goal of clustering is summarization. From this semantic point of view, the approach does not seem to achieve the primary application-specific goal of data summariza-tion. The worst-case complexity of the approach and the number of discovered patterns can be exponentially related to the number of dimensions. The approach may not terminate at low values of the density (support) threshold τ .

7.4.2 PROCLUS

The PROjected CLUStering (PROCLUS) algorithm uses a medoid-based approach to clus-tering. The algorithm proceeds in three phases: an initialization phase, an iterative phase, and a cluster refinement phase. The initialization phase selects a small candidate set M of medoids, which restricts the search space for hill climbing. In other words, the final medoid set will be a subset of the candidate set M . The iterative phase uses a medoid-based tech-nique for hill climbing to better solutions until convergence. The final refinement phase assigns data points to the optimal medoids and removes outliers.

A small candidate set M of medoids is selected during initialization as follows:

A random sample M of data points of size proportional to the number of clusters k is picked. Let the size of this subset be denoted by A · k, where A is a constant greater than 1.

2. A greedy method is used to further reduce the size of the set M to B · k, where A > B > 1. Specifically, a farthest distance approach is applied, where points are iteratively selected by selecting the data point with the farthest distance to the closest of the previously selected points.

Although the selection of a small candidate medoid set M greatly reduces the complexity of the search space, it also tends to include many outliers because of its farthest distance approach. Nevertheless, the farthest distance approach ensures well-separated seeds, which also tend to separate out the clusters well.

The algorithm starts by choosing a random subset S of k medoids from M , and it pro-gressively improves the quality of medoids by iteratively replacing the “bad” medoids in the current set with new points from M . The best set of medoids found so far is always stored in S_best. Each medoid in S is associated with a set of dimensions based on the statistical distribution of data points in its locality. This set of dimensions represents the subspace specific to the corresponding cluster. The algorithm determines a set of “bad” medoids in S_best, using an approach described later. These bad medoids are replaced with randomly selected replacement points from M and the impact on the objective function is measured. If the objective function improves, then the current best set of medoids S_best is updated to S. Otherwise, another randomly selected replacement set is tried for exchanging with the bad medoids in S_best in the next iteration. If the medoids in S_best do not improve for a predefined number of successive replacement attempts, then the algorithm terminates. All computations, such as the assignment and objective function computation, are executed in the subspace associated with each medoid. The overall algorithm is illustrated in Fig. 7.4. Next, we provide a detailed description of each of the aforementioned steps.

Determining projected dimensions for a medoid: The aforementioned approach requires the determination of the quality of a particular set of medoids. This requires the assignment of

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 130 131 132 133 134 135 136 137 ... 423