Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	141/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 137 138 139 140 141 142 143 144 ... 423

1-Data Mining tarjima

7.6. HUMAN AND VISUALLY SUPERVISED CLUSTERING

229

Algorithm IPCLUS(Data Set: D, Polarization Points: k)

begin
while not(termination criterion) do

begin
Randomly sample k points Y₁ . . . Y_k from D;

Compute 2-dimensional subspace E polarized around Y₁ . . . Y_k; Generate density profile in E and present to user;

Record membership statistics of clusters based on user-specified density-based feedback;

end;

return consensus clusters from membership statistics;

end

Figure 7.6: The IPCLUS algorithm
as PROCLUS address these issues by making hard decisions about how the data should be most appropriately summarized. Clearly, such decisions can be made more eﬀectively by interactive user exploration of these alternative views and creating a final consensus from these diﬀerent views. The advantage of involving the user is the greater intuition available in terms of the quality of feedback provided to the clustering process. The result of this cooperative technique is a system that can perform the clustering task better than either a human or a computer.

The idea behind the Interactive Projected CLUStering algorithm (IPCLUS) is to provide the user with a set of meaningful visualizations in lower dimensional projections together with the ability to decide how to separate the clusters. The overall algorithm is illustrated in Fig. 7.6. The interactive projected clustering algorithm works in a series of iterations; in each, a projection is determined in which there are distinct sets of points that can be clearly distinguished from one another. Such projections are referred to as well polarized. In a well-polarized projection, it is easier for the user to clearly distinguish a set of clusters from the rest of the data. Examples of the data density distribution of a well-polarized projection and a poorly polarized projection are illustrated in Fig. 7.7a and b, respectively.

These polarized projections are determined by randomly selecting a set of k records from the database that are referred to as the polarization anchors. The number of polariza-tion anchors k is one of the inputs to the algorithm. A 2-dimensional subspace of the data is determined such that the data are clustered around each of these polarization anchors. Specifically, a 2-dimensional subspace is selected so that the mean square radius of assign-ments of data points to the polarization points as anchors is minimized. Diﬀerent projections are repeatedly determined with diﬀerent sampled anchors in which the user can provide feed-back. A consensus clustering is then determined from the diﬀerent clusterings generated by the user over multiple subspace views of the data.

The polarization subspaces can be determined either in axis-parallel subspaces or arbi-trary subspaces, although the former provides greater interpretability. The overall approach for determining polarization subspaces starts with the full dimensionality and iteratively reduces the dimensionality of the current subspace until a 2-dimensional subspace is obtained. This is achieved by iteratively assigning data points to their closest subspace-specific anchor points in each iteration, while discarding the most noisy (high variance) dimensions in each iteration about the polarization points. The dimensionality is reduced

230 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

Figure 7.7: Polarizations of diﬀerent quality and flexibility of user interaction

by a constant factor of 2 in each iteration. Thus, this is a k-medoids type approach, which reduces the dimensionality of the subspaces for distance computation, but not the seeds in each iteration. This typically results in the discovery of a 2-dimensional subspace that is highly clustered around the polarization anchors. Of course, if the polarization anchors are poorly sampled, then this will result in poorly separated clusters. Nevertheless, repeated sampling of polarization points ensures that good subspaces will be selected in at least a few iterations.

After the projection subspace has been found, kernel density estimation techniques can be used to determine the data density at each point in a 2-dimensional grid of values in the relevant subspace. The density values at these grid points are used to create a surface plot. Examples of such plots are illustrated in Fig. 7.7. Because clusters correspond to dense regions in the data, they are represented by peaks in the density profile. To actually separate out the clusters, the user can visually specify density value thresholds that correspond to noise levels at which clusters can be separated from one another. Specifically, a cluster may be defined to be a connected region in the space with density above a certain noise threshold

that is specified by the user. This cluster may be of arbitrary shape, and the points inside it can be determined. Note that when the density distribution varies significantly with locality, diﬀerent numbers, shapes, and sizes of clusters will be discovered by diﬀerent density thresholds. Examples of density thresholding are illustrated in Fig. 7.7c and 7.7d,

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 137 138 139 140 141 142 143 144 ... 423