Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	142/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 138 139 140 141 142 143 144 145 ... 423

1-Data Mining tarjima

7.7. CLUSTER ENSEMBLES

231

where clusters of diﬀerent numbers and shapes are discovered at diﬀerent thresholds. It is in this step that the user intuition is very helpful, both in terms of deciding which polarized projections are most relevant, and in terms of deciding what density thresholds to specify. If desired, the user may discard a projection altogether or specify multiple thresholds in the same projection to discover clusters of diﬀerent density in diﬀerent localities. The specification of the density threshold τ need not be done directly by value. The density separator hyperplane can be visually superposed on the density profile with the help of a graphical interface.

Each feedback of the user results in the generation of connected sets of points within the density contours. These sets of points can be viewed as one or more binary “transac-tions” drawn on the “item” space of data points. The key is to determine the consensus clusters from these newly created transactions that encode user feedback. While the prob-lem of finding consensus clusters from multiple clusterings will be discussed in detail in the next section, a very simple way of doing this is to use either frequent pattern mining (to find overlapping clusters) or a second level of clustering on the transactions to generate nonoverlapping clusters. Because this new set of transactions encodes the user preferences, the quality of the clusters found with such an approach will typically be quite high.

7.7 Cluster Ensembles

The previous section illustrated how diﬀerent views of the data can lead to diﬀerent solutions to the clustering problem. This notion is closely related to the concept of multiview clustering or ensemble clustering, which studies this issue from a broader perspective. It is evident from the discussion in this chapter and the previous one that clustering is an unsupervised problem with many alternative solutions. In spite of the availability of a large number of validation criteria, the ability to truly test the quality of a clustering algorithm remains elusive. The goal in ensemble clustering is to combine the results of many clustering models to create a more robust clustering. The idea is that no single model or criterion truly captures the optimal clustering, but an ensemble of models will provide a more robust solution.

Most ensemble models use the following two steps to generate the clustering solution:

Generate k diﬀerent clusterings with the use of diﬀerent models or data selection mechanisms. These represent the diﬀerent ensemble components.

Combine the diﬀerent results into a single and more robust clustering.

The following section provides a brief overview of the diﬀerent ways in which the alternative clusterings can be constructed.

7.7.1 Selecting Diﬀerent Ensemble Components

The diﬀerent ensemble components can be selected in a wide variety of ways. They can be either modelbased or data-selection based. In model-based ensembles, the diﬀerent compo-nents of the ensemble reflect diﬀerent models, such as the use of diﬀerent clustering models, diﬀerent settings of the same model, or diﬀerent clusterings provided by diﬀerent runs of the same randomized algorithm. Some examples follow:

The diﬀerent components can be a variety of models such as partitioning methods, hierarchical methods, and density-based methods. The qualitative diﬀerences between the models will be data set-specific.

232 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

The diﬀerent components can correspond to diﬀerent settings of the same algorithm. An example is the use of diﬀerent initializations for algorithms such as k-means or EM, the use of diﬀerent mixture models for EM, or the use of diﬀerent parameter settings of the same algorithm, such as the choice of the density threshold in DBSCAN. An ensemble approach is useful because the optimum choice of parameter settings is also hard to determine in an unsupervised problem such as clustering.

The diﬀerent components could be obtained from a single algorithm. For example, a 2-means clustering applied to the 1-dimensional embedding obtained from spectral clustering will yield a diﬀerent clustering solution for each eigenvector. Therefore, the smallest k nontrivial eigenvectors will provide k diﬀerent solutions that are often quite diﬀerent as a result of the orthogonality of the eigenvectors.

A second way of selecting the diﬀerent components of the ensemble is with the use of data selection. Data selection can be performed in two diﬀerent ways:

Yüklə 17,13 Mb.

Dostları ilə paylaş: