Document collections: Large amounts of document data, which are usually unlabeled, are available on the Web. A common approach is to manually label the documents, which is a slow, painstaking, and laborious process. Alternatively, crowdsourcing mechanisms, such as Amazon Mechanical Turk, may be used. However, such mecha-nisms typically incur a dollar-cost on a per-instance basis.
Privacy-constrained data sets: In many scenarios, the labels on records may be sensi-tive information that can be acquired at a significant query cost (e.g., obtaining per-mission from the relevant entity). In such cases, costs are harder to quantify explicitly, but can nevertheless be estimated through modeling.
Social networks: In social networks, it may be desirable to identify nodes with spe-cific properties. For example, an advertising company may desire to identify social network nodes that are interested in “cosmetics.” However, it is rare that labels will be explicitly associated with the nodes. Identification of relevant nodes may require either manual examination of social network posts or user surveys. Both processes are time-consuming and costly.
It is clear from the aforementioned examples that the acquisition of labels should be viewed as a cost-centric process that helps improve modeling accuracy. The goal in active learning is to maximize the accuracy of classification at a specific cost of label acquisition. Therefore, active learning integrates label acquisition and model construction. This is different from all the other algorithms discussed in this book, where it is assumed that training data labels are already available.
Not all training examples are equally informative. To illustrate this point, consider the two-class problem illustrated in Fig. 11.3. The two classes, labeled by A and B, respec-tively, have a vertical decision boundary separating them. Suppose that the acquisition of labels is so costly that one is only allowed to acquire the labels of four examples from the entire data set and use this set of four examples to train a model. Clearly, this is a very small number of training examples, and the wrong choice of training examples may lead to significant overfitting. For example, in the case of Fig. 11.3a, the four examples have been randomly sampled from the data set. A typical linear classifier, such as logistic regression, may determine a decision boundary, corresponding to the dashed line in Fig. 11.3a. It is evident that this decision boundary is a poor representation of the true (vertical) decision boundary. On the other hand, in the case of Fig. 11.3b, the sampled examples are chosen more carefully to align along the true decision boundary. This set of labeled examples will result in a much better classification model for the entire data set. The goal in active learn-ing is to integrate the labeling and classification process in a single framework to create
11.7. ACTIVE LEARNING
|
369
|
1
|
|
|
|
|
|
|
|
|
|
|
1
|
|
|
|
|
|
|
|
|
|
|
|
0.9
|
|
|
|
RANDOMLY SAMPLED POINTS
|
|
|
|
|
0.9
|
|
|
ACTIVELY SAMPLED POINTS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.8
|
|
|
|
|
|
|
|
|
|
|
0.8
|
|
|
|
|
|
|
|
|
|
|
|
0.7
|
|
|
|
|
|
|
|
|
|
|
0.7
|
|
|
|
|
|
|
|
|
|
|
|
0.6
|
|
|
|
|
|
|
|
|
|
|
0.6
|
|
|
|
|
|
|
|
|
|
|
|
0.5
|
|
|
|
|
|
|
|
|
|
|
0.5
|
|
|
|
|
|
|
|
|
|
|
|
0.4
|
|
|
|
|
|
|
|
|
|
|
0.4
|
|
|
|
|
|
|
|
|
|
|
|
0.3
|
|
|
|
|
|
|
|
|
|
|
0.3
|
|
|
|
|
|
|
|
|
|
|
|
0.2
|
|
|
|
DECISION BOUNDARY FOUND BY MODEL
|
|
|
0.2
|
|
|
DECISION BOUNDARY FOUND BY MODEL
|
|
|
|
|
0.1
|
CLASS A
|
|
|
|
|
CLASS B
|
|
|
|
|
0.1
|
CLASS A
|
|
|
|
CLASS B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
00
|
0.1
|
0.2
|
0.3
|
0.4
|
0.5
|
0.6
|
0.7
|
0.8
|
0.9
|
1
|
00
|
0.1
|
0.2
|
0.3
|
0.4
|
0.5
|
0.6
|
0.7
|
0.8
|
0.9
|
1
|
|
(a) Random sampling (b) Active sampling
Figure 11.3: Impact of active sampling on decision boundary
robust models. In practice, the determination of the correct choice of query instances is a very challenging problem. The key is to use the knowledge gained from the labels already acquired to “guess” the most informative regions in which to query the labels. Such an approach can help discover the true shape of the decision boundary as quickly as possible. Therefore, the key question in active learning is as follows:
How do we select instances to label to create the most accurate model at a given cost?
In some scenarios, the labeling cost may be instance-specific cost, although most mod-els use the simplifying assumption of equal costs over all instances. Every active learning system has two primary components, one of which is already given:
Oracle: The oracle provides the responses to the underlying query in the form of labels of specified test instances. The oracle may be a human labeler, or a cost-driven data-acquisition system, such as Amazon Mechanical Turk. In general, for modeling purposes, the oracle is viewed as a black-box that is part of the input to the process.
Dostları ilə paylaş: |