Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	225/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 221 222 223 224 225 226 227 228 ... 423

1-Data Mining tarjima

y_i(	·	+ b) ≥ 1 − ξ_i ∀i.	(11.22)
	·	+ b) ≥ 1 − ξ_i ∀i.	(11.22)
		W	X_i

In addition, the nonnegativity constraint ξ_i ≥ 0 on the slack variables is observed. Note that the value of y_i is known, because the training examples are labeled. For the case of unlabeled examples, binary decision variables z_i ∈ {−1, +1} (with corresponding slack penalties) are incorporated for each unlabeled training example X_i ∈ U. These decision variables correspond to the assignment of the unlabeled examples to a particular class. The following constraint is added to the optimization problem:

z_i(		·		+ b) ≥ 1 − ξ_i ∀i :		∈ U.	(11.23)
	W		^Xi		^Xi

The slack penalties for the unlabeled examples can also be included in the optimization objective function. Note that, unlike y_i, the value of z_i is not known, and it is a binary integer variable that becomes a part of the optimization problem. Furthermore, the modified optimization formulation is an integer program, which is far more diﬃcult than the original convex optimization problem for support vector machines.

A number of techniques have, therefore, been designed to approximately solve this prob-lem with iterative mechanisms. One of these methods starts by labeling the most confidently predicted examples and iteratively expanding them. The number of positive examples ini-tially labeled from the unlabeled instances, is based on the required trade-oﬀ between pre-cision and recall. This ratio of positive to negative examples is maintained throughout the iterative algorithm. In each iteration, one positive example is changed to negative, and one negative example to positive to improve the soft margin of the classifier as much as possible. The bibliographic notes contain a discussion of the methods commonly used in this context.

11.6. SEMISUPERVISED LEARNING

367

11.6.3 Graph-Based Semisupervised Learning

The conversion of arbitrary data types to graphs is discussed in Sect. 2.2.2.9 of Chap. 2. Therefore, one advantage of this approach is that it can be used for semisupervised classi-fication of arbitrary data types, as long as a distance function is available for quantifying proximity between data objects. This is a property that graph-based methods inherit from their origins in nearest-neighbor classification. The steps in graph-based semisupervised learning are as follows:

Construct a similarity graph on both the labeled and the unlabeled data records. Each data object O_i is associated with a node in the similarity graph. Each object is connected to its k-nearest neighbors.

The weight w_ij of the edge (i, j) is equal to a kernelized function of the distance d(O_i, O_j ) between the objects O_i and O_j , so that larger weights indicate greater similarity. A typical example of the weight is based on the heat kernel [90]:

_w_ij= _e−^d⁽^Oi^,Oj ⁾²^/t² _. (11.24)

Here, t is a user-defined parameter.

This problem is one where we have a graph containing both labeled and unlabeled nodes. It is now desired to infer the labels of the unlabeled nodes with the use of these proximity relationships. This problem is exactly identical to the collective classification problem intro-duced in Sect. 19.4 of Chap. 19. Readers are advised to refer to the methods discussed in that section.

Graph-based semisupervised learning may be viewed as a semisupervised extension of nearest-neighbor classifiers. The only diﬀerence of graph-based semisupervised methods from nearest-neighbor classifiers is the way in which similarity graphs are constructed. Nearest-neighbor methods can be conceptually viewed as collective classification methods on similarity graphs in which edges are added only between pairs of labeled and unla-beled instances. Nearest-neighbor classification simply selects the dominant label from the labeled nodes incident on an unlabeled node. In the semisupervised case, edges can be added between any pair of nodes, whether they are labeled or unlabeled. The addition of these extra edges is necessary in semisupervised learning because of the scarcity of the labeled nodes in the similarity graph. Such edges are able to associate unlabeled clusters of arbitrary shape to their closest labeled instances more eﬀectively. The reader is referred to Sect. 19.4 of Chap. 19 for discussions on collective classification.

11.6.4 Discussion of Semisupervised Learning

An important question in semisupervised learning is whether unlabeled data always helps in improving classification accuracy. Semisupervised learning depends on the inherent class structure of the underlying data. For semisupervised learning to be eﬀective, the class structure of the data should approximately match its clustering structure. This assumption is obvious in the case of the semisupervised EM algorithm. The assumption is, however, implicitly used by other methods as well.

In practice, semisupervised learning is most eﬀective when the number of labeled exam-ples is extremely small, and there is no realistic way of making confident predictions about scarcely populated regions of the space. In some domains, such as node classification of graphs, this is almost always true. Therefore, in such domains, the transductive setting is

368 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

the only way in which classification can be performed. These methods will be discussed in detail in Sect. 19.4 of Chap. 19. On the other hand, when a lot of labeled data is already available, then the unlabeled examples do not provide much advantage to the learner, and can, in fact, be harmful in some cases.

11.7 Active Learning

From the discussion in the previous section on semisupervised classification, it is evident that labeled data are often scarce in real applications. While labeled data are often expensive to obtain, the cost of procuring labeled data can often be quantified. Some examples of costly labeling mechanisms are as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 221 222 223 224 225 226 227 228 ... 423