Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	226/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 222 223 224 225 226 227 228 229 ... 423

1-Data Mining tarjima

Document collections: Large amounts of document data, which are usually unlabeled, are available on the Web. A common approach is to manually label the documents, which is a slow, painstaking, and laborious process. Alternatively, crowdsourcing mechanisms, such as Amazon Mechanical Turk, may be used. However, such mecha-nisms typically incur a dollar-cost on a per-instance basis.

Privacy-constrained data sets: In many scenarios, the labels on records may be sensi-tive information that can be acquired at a significant query cost (e.g., obtaining per-mission from the relevant entity). In such cases, costs are harder to quantify explicitly, but can nevertheless be estimated through modeling.

Social networks: In social networks, it may be desirable to identify nodes with spe-cific properties. For example, an advertising company may desire to identify social network nodes that are interested in “cosmetics.” However, it is rare that labels will be explicitly associated with the nodes. Identification of relevant nodes may require either manual examination of social network posts or user surveys. Both processes are time-consuming and costly.

It is clear from the aforementioned examples that the acquisition of labels should be viewed as a cost-centric process that helps improve modeling accuracy. The goal in active learning is to maximize the accuracy of classification at a specific cost of label acquisition. Therefore, active learning integrates label acquisition and model construction. This is diﬀerent from all the other algorithms discussed in this book, where it is assumed that training data labels are already available.

Not all training examples are equally informative. To illustrate this point, consider the two-class problem illustrated in Fig. 11.3. The two classes, labeled by A and B, respec-tively, have a vertical decision boundary separating them. Suppose that the acquisition of labels is so costly that one is only allowed to acquire the labels of four examples from the entire data set and use this set of four examples to train a model. Clearly, this is a very small number of training examples, and the wrong choice of training examples may lead to significant overfitting. For example, in the case of Fig. 11.3a, the four examples have been randomly sampled from the data set. A typical linear classifier, such as logistic regression, may determine a decision boundary, corresponding to the dashed line in Fig. 11.3a. It is evident that this decision boundary is a poor representation of the true (vertical) decision boundary. On the other hand, in the case of Fig. 11.3b, the sampled examples are chosen more carefully to align along the true decision boundary. This set of labeled examples will result in a much better classification model for the entire data set. The goal in active learn-ing is to integrate the labeling and classification process in a single framework to create

11.7. ACTIVE LEARNING

369

1												1
0.9				RANDOMLY SAMPLED POINTS								0.9				ACTIVELY SAMPLED POINTS

0.8												0.8
0.7												0.7
0.6												0.6
0.5												0.5
0.4												0.4
0.3												0.3
0.2				DECISION BOUNDARY FOUND BY MODEL								0.2				DECISION BOUNDARY FOUND BY MODEL
0.1	CLASS A						CLASS B						0.1	CLASS A					CLASS B


⁰0	0.1	0.2	0.3	0.4	0.5	0.6		0.7	0.8	0.9	1	⁰0		0.1	0.2	0.3	0.4	0.5		0.6	0.7	0.8	0.9	1

(a) Random sampling (b) Active sampling

Figure 11.3: Impact of active sampling on decision boundary

robust models. In practice, the determination of the correct choice of query instances is a very challenging problem. The key is to use the knowledge gained from the labels already acquired to “guess” the most informative regions in which to query the labels. Such an approach can help discover the true shape of the decision boundary as quickly as possible. Therefore, the key question in active learning is as follows:

How do we select instances to label to create the most accurate model at a given cost?

In some scenarios, the labeling cost may be instance-specific cost, although most mod-els use the simplifying assumption of equal costs over all instances. Every active learning system has two primary components, one of which is already given:

Oracle: The oracle provides the responses to the underlying query in the form of labels of specified test instances. The oracle may be a human labeler, or a cost-driven data-acquisition system, such as Amazon Mechanical Turk. In general, for modeling purposes, the oracle is viewed as a black-box that is part of the input to the process.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 222 223 224 225 226 227 228 229 ... 423