Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə142/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   138   139   140   141   142   143   144   145   ...   423
1-Data Mining tarjima

7.7. CLUSTER ENSEMBLES

231

where clusters of different numbers and shapes are discovered at different thresholds. It is in this step that the user intuition is very helpful, both in terms of deciding which polarized projections are most relevant, and in terms of deciding what density thresholds to specify. If desired, the user may discard a projection altogether or specify multiple thresholds in the same projection to discover clusters of different density in different localities. The specification of the density threshold τ need not be done directly by value. The density separator hyperplane can be visually superposed on the density profile with the help of a graphical interface.


Each feedback of the user results in the generation of connected sets of points within the density contours. These sets of points can be viewed as one or more binary “transac-tions” drawn on the “item” space of data points. The key is to determine the consensus clusters from these newly created transactions that encode user feedback. While the prob-lem of finding consensus clusters from multiple clusterings will be discussed in detail in the next section, a very simple way of doing this is to use either frequent pattern mining (to find overlapping clusters) or a second level of clustering on the transactions to generate nonoverlapping clusters. Because this new set of transactions encodes the user preferences, the quality of the clusters found with such an approach will typically be quite high.


7.7 Cluster Ensembles


The previous section illustrated how different views of the data can lead to different solutions to the clustering problem. This notion is closely related to the concept of multiview clustering or ensemble clustering, which studies this issue from a broader perspective. It is evident from the discussion in this chapter and the previous one that clustering is an unsupervised problem with many alternative solutions. In spite of the availability of a large number of validation criteria, the ability to truly test the quality of a clustering algorithm remains elusive. The goal in ensemble clustering is to combine the results of many clustering models to create a more robust clustering. The idea is that no single model or criterion truly captures the optimal clustering, but an ensemble of models will provide a more robust solution.


Most ensemble models use the following two steps to generate the clustering solution:





  1. Generate k different clusterings with the use of different models or data selection mechanisms. These represent the different ensemble components.




  1. Combine the different results into a single and more robust clustering.

The following section provides a brief overview of the different ways in which the alternative clusterings can be constructed.


7.7.1 Selecting Different Ensemble Components


The different ensemble components can be selected in a wide variety of ways. They can be either modelbased or data-selection based. In model-based ensembles, the different compo-nents of the ensemble reflect different models, such as the use of different clustering models, different settings of the same model, or different clusterings provided by different runs of the same randomized algorithm. Some examples follow:





  1. The different components can be a variety of models such as partitioning methods, hierarchical methods, and density-based methods. The qualitative differences between the models will be data set-specific.

232 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS





  1. The different components can correspond to different settings of the same algorithm. An example is the use of different initializations for algorithms such as k-means or EM, the use of different mixture models for EM, or the use of different parameter settings of the same algorithm, such as the choice of the density threshold in DBSCAN. An ensemble approach is useful because the optimum choice of parameter settings is also hard to determine in an unsupervised problem such as clustering.




  1. The different components could be obtained from a single algorithm. For example, a 2-means clustering applied to the 1-dimensional embedding obtained from spectral clustering will yield a different clustering solution for each eigenvector. Therefore, the smallest k nontrivial eigenvectors will provide k different solutions that are often quite different as a result of the orthogonality of the eigenvectors.

A second way of selecting the different components of the ensemble is with the use of data selection. Data selection can be performed in two different ways:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   138   139   140   141   142   143   144   145   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin