Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə221/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   217   218   219   220   221   222   223   224   ...   423
1-Data Mining tarjima

CHAPTER 11.

DATA CLASSIFICATION: ADVANCED CONCEPTS




1































1











































SINGLE TRAINING EXAMPLE




















































0.9







FOR CLASS A



















0.9





































































































































SINGLE LABELED




























0.8































0.8

EXAMPLE FOR CLASS A
























































































0.7































0.7

NEW DECISION
































































BOUNDARY































0.6































0.6








































DECISION BOUNDARY BASED ON TRAINING PAIR

















































0.5































0.5


































0.4































0.4
























































































MANY UNLABELED







0.3































0.3



















EXAMPLES










0.2































0.2











































SINGLE TRAINING EXAMPLE































SINGLE LABELED










0.1







FOR CLASS B



















0.1
















EXAMPLE FOR CLASS B










00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1




(a) Labeled data (b) Labeled and unlabeled data


Figure 11.2: Impact of unlabeled data on classification


Fig. 11.2a, if a test instance were provided near the coordinates (1, 0.7) with only the original training data, then almost any classifier, such as the nearest-neighbor classifier, will assign the data points to class A. However, this prediction is not reliable because of few previously seen labeled examples in the locality of the test instance. However, the unlabeled examples could be used to expand the labeled examples appropriately, by incrementally labeling the unlabeled examples in each hyperplane of Fig. 11.2b with the appropriate class. At this point, it becomes evident that test instances near the coordinates (1, 0.7) really belong to class B.

A different way of understanding the impact of feature correlation estimation is by examining the intuitively interpretable text domain. Consider a scenario where one were trying to determine whether documents belong to the “Science” category. It is possible, that not enough labeled documents may contain the word “Einstein” in the documents. However, the word “Einstein” may often co-occur with other (more common) words such as “Physics” in unlabeled documents. At the same time, these more common words may already have been associated with the “Science” category because of their presence in labeled documents. Thus, the unlabeled documents provide the insight that the word “Einstein” is also relevant to the “Science” category. This example shows that unlabeled data can be used to learn joint feature distributions that are very relevant to the classification process.


Many of the semisupervised methods are often termed as transductive because they cannot handle out-of-sample test instances. In other words, all test instances need to be specified at the time of constructing the training model. New out-of-sample instances cannot be classified after the model has been constructed. This is different from most of the inductive classifiers discussed in the previous chapter in which training and testing phases are cleanly separated.


There are two primary types of techniques that are used for semisupervised learning. Some of these methods are meta-algorithms that can use any existing classification algorithm as a subroutine, and leverage it to incorporate the impact of unlabeled data. The second type of methods are those in which a number of modifications are incorporated in specific classifiers to account for the impact of unlabeled data. Two examples of the second type of methods are semisupervised Bayes classifiers, and transductive support vector machines. This section will discuss both these classes of techniques.






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   217   218   219   220   221   222   223   224   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin