Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə138/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   134   135   136   137   138   139   140   141   ...   423
1-Data Mining tarjima

7.5. SEMISUPERVISED CLUSTERING

225

toward an application-specific goal is with the use of supervision. For example, consider the case where an analyst wishes to segment a set of documents approximately along the lines of the Open Directory Project (ODP),3 where users have already manually labeled documents into a set of predefined categories. One may want to use this directory only as soft guiding principle because the number of clusters and their topics in the analyst’s collection may not always be exactly the same as in the ODP clusters. One way of incorporating supervision is to download example documents from each category of ODP and mix them with the documents that need to be clustered. This newly downloaded set of documents are labeled with their category and provide information about how the features are related to the different clusters (categories). The added set of labeled documents, therefore, provides supervision to the clustering process in the same way that a teacher guides his or her students toward a specific goal.


A different scenario is one in which it is known from background knowledge that certain documents should belong to the same class, and others should not. Correspondingly, two types of semisupervision are commonly used in clustering:





  1. Pointwise supervision: Labels are associated with individual data points and provide information about the category (or cluster) of the object. This version of the problem is closely related to that of data classification.




  1. Pairwise supervision: “Must-link” and “cannot-link” constraints are provided for the individual data points. This provides information about cases where pairs of objects are allowed to be in the same cluster or are forbidden to be in the same cluster, respectively. This form of supervision is also sometimes referred to as constrained clustering.

For each of these variations, a number of simple semisupervised clustering methods are described in the following sections.


7.5.1 Pointwise Supervision


Pointwise supervision is significantly easier to address than pairwise supervision because the labels associated with the data points can be used more naturally in conjunction with existing clustering algorithms. In soft supervision, the labels are used as guidance, but data points with different labels are allowed to mix. In hard supervision, data points with different labels are not allowed to mix. Some examples of different ways of modifying existing clustering algorithms are as follows:





  1. Semisupervised clustering by seeding: In this case, the initial seeds for a k-means algo-rithm are chosen as data points of different labels. These are used to execute a stan-dard k-means algorithm. The biased initialization has a significant impact on the final results, even when labeled data points are allowed to be assigned to a cluster whose initial seed had a different label (soft supervision). In hard supervision, clusters are explicitly associated with labels corresponding to their initial seeds. The assignment of labeled data points is constrained so that such points can be assigned to a cluster with the same label. In some cases, the weights of the unlabeled points are discounted while computing cluster centers to increase the impact of supervision. The second form of semisupervision is closely related to semisupervised classification, which is







  • http://www.dmoz.org/.

226 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

discussed in Chap. 11. An EM algorithm, which performs semisupervised classifica-tion with labeled and unlabeled data, uses a similar approach. Refer to Sect. 11.6 of Chap. 11 for a discussion of this algorithm. For more robust initialization, an unsu-pervised clustering can be separately applied to each labeled data segment to create the seeds.






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   134   135   136   137   138   139   140   141   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin