yi(
|
|
·
|
|
+ b) ≥ 1 − ξi ∀i.
|
(11.22)
|
|
|
|
|
|
W
|
Xi
|
|
In addition, the nonnegativity constraint ξi ≥ 0 on the slack variables is observed. Note that the value of yi is known, because the training examples are labeled. For the case of unlabeled examples, binary decision variables zi ∈ {−1, +1} (with corresponding slack penalties) are incorporated for each unlabeled training example Xi ∈ U. These decision variables correspond to the assignment of the unlabeled examples to a particular class. The following constraint is added to the optimization problem:
zi(
|
|
·
|
|
+ b) ≥ 1 − ξi ∀i :
|
|
∈ U.
|
(11.23)
|
|
W
|
Xi
|
Xi
|
|
The slack penalties for the unlabeled examples can also be included in the optimization objective function. Note that, unlike yi, the value of zi is not known, and it is a binary integer variable that becomes a part of the optimization problem. Furthermore, the modified optimization formulation is an integer program, which is far more difficult than the original convex optimization problem for support vector machines.
A number of techniques have, therefore, been designed to approximately solve this prob-lem with iterative mechanisms. One of these methods starts by labeling the most confidently predicted examples and iteratively expanding them. The number of positive examples ini-tially labeled from the unlabeled instances, is based on the required trade-off between pre-cision and recall. This ratio of positive to negative examples is maintained throughout the iterative algorithm. In each iteration, one positive example is changed to negative, and one negative example to positive to improve the soft margin of the classifier as much as possible. The bibliographic notes contain a discussion of the methods commonly used in this context.
11.6. SEMISUPERVISED LEARNING
|
367
|
11.6.3 Graph-Based Semisupervised Learning
The conversion of arbitrary data types to graphs is discussed in Sect. 2.2.2.9 of Chap. 2. Therefore, one advantage of this approach is that it can be used for semisupervised classi-fication of arbitrary data types, as long as a distance function is available for quantifying proximity between data objects. This is a property that graph-based methods inherit from their origins in nearest-neighbor classification. The steps in graph-based semisupervised learning are as follows:
Construct a similarity graph on both the labeled and the unlabeled data records. Each data object Oi is associated with a node in the similarity graph. Each object is connected to its k-nearest neighbors.
The weight wij of the edge (i, j) is equal to a kernelized function of the distance d(Oi, Oj ) between the objects Oi and Oj , so that larger weights indicate greater similarity. A typical example of the weight is based on the heat kernel [90]:
wij = e−d(Oi,Oj )2/t2 . (11.24)
Here, t is a user-defined parameter.
This problem is one where we have a graph containing both labeled and unlabeled nodes. It is now desired to infer the labels of the unlabeled nodes with the use of these proximity relationships. This problem is exactly identical to the collective classification problem intro-duced in Sect. 19.4 of Chap. 19. Readers are advised to refer to the methods discussed in that section.
Graph-based semisupervised learning may be viewed as a semisupervised extension of nearest-neighbor classifiers. The only difference of graph-based semisupervised methods from nearest-neighbor classifiers is the way in which similarity graphs are constructed. Nearest-neighbor methods can be conceptually viewed as collective classification methods on similarity graphs in which edges are added only between pairs of labeled and unla-beled instances. Nearest-neighbor classification simply selects the dominant label from the labeled nodes incident on an unlabeled node. In the semisupervised case, edges can be added between any pair of nodes, whether they are labeled or unlabeled. The addition of these extra edges is necessary in semisupervised learning because of the scarcity of the labeled nodes in the similarity graph. Such edges are able to associate unlabeled clusters of arbitrary shape to their closest labeled instances more effectively. The reader is referred to Sect. 19.4 of Chap. 19 for discussions on collective classification.
11.6.4 Discussion of Semisupervised Learning
An important question in semisupervised learning is whether unlabeled data always helps in improving classification accuracy. Semisupervised learning depends on the inherent class structure of the underlying data. For semisupervised learning to be effective, the class structure of the data should approximately match its clustering structure. This assumption is obvious in the case of the semisupervised EM algorithm. The assumption is, however, implicitly used by other methods as well.
In practice, semisupervised learning is most effective when the number of labeled exam-ples is extremely small, and there is no realistic way of making confident predictions about scarcely populated regions of the space. In some domains, such as node classification of graphs, this is almost always true. Therefore, in such domains, the transductive setting is
368 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS
the only way in which classification can be performed. These methods will be discussed in detail in Sect. 19.4 of Chap. 19. On the other hand, when a lot of labeled data is already available, then the unlabeled examples do not provide much advantage to the learner, and can, in fact, be harmful in some cases.
11.7 Active Learning
From the discussion in the previous section on semisupervised classification, it is evident that labeled data are often scarce in real applications. While labeled data are often expensive to obtain, the cost of procuring labeled data can often be quantified. Some examples of costly labeling mechanisms are as follows:
|