11.10. BIBLIOGRAPHIC NOTES
|
385
|
In multiclass learning methods, binary classifiers are combined in a meta-algorithm framework. Typically, either a one-against-rest, or a one-against-one approach is used. The voting from different classifiers is used to provide a final result. In many scenarios, the one-against-one approach is more efficient than the one-against-rest approach. Many scalable methods have been designed for data classification. For decision trees, two scalable methods include RainForest and BOAT. Numerous fast variations of SVM classifiers have also been designed.
The rare class learning problem is very common, especially because class distributions are very imbalanced in many real scenarios. Typically, the objective function for the classi-fication problem is changed with the use of cost-weighting. Cost-weighting can be achieved either with example weighting, or with example resampling. Typically, the normal class is undersampled in example resampling, which results in better training efficiency.
The paucity of training data is common in real domains. Semisupervised learning is one way of addressing the paucity of training data. In these methods, the copiously available unlabeled data is used to make estimations about class distributions in regions where little labeled data is available. A second way of leveraging the copious unlabeled data, is by actively assigning labels so that the most informative labels are determined for classification.
Ensemble methods improve classifier accuracy by reducing their bias and variance. Some ensemble methods, such as bagging and random forests, are designed only to reduce the variance, whereas other ensemble methods, such as boosting and stacking, can help reduce both the bias and variance. In some cases, such as boosting, it is possible for the ensemble to overfit the noise in the data and thereby lead to lower accuracy.
11.10 Bibliographic Notes
Multiclass strategies are used for those classifiers that are designed to be binary. A typical example of such a classifier is the support vector machine. The one-against-rest strategy is introduced in [106]. The one-against-one strategy is discussed in [318].
Some examples of scalable decision tree methods include SLIQ [381], BOAT [227], and RainForest [228]. Some early parallel implementations of decision trees include the SPRINT method [458]. An example of a scalable SVM method is SVMLight [291]. Other methods, such as SVMPerf [292 ], reformulate the SVM optimization to reduce the number of slack variables, and increase the number of constraints. A cutting plane approach that works with a small subset of constraints at a time is used to make the SVM classifier scalable. This approach is discussed in detail in Chap. 13 on text mining.
Detailed discussions on imbalance and cost-sensitive learning may be found in [136,
139, 193]. A variety of general methods have been proposed for cost-sensitive learning, such as MetaCost [174], weighting [531], and sampling [136, 531 ]. The SMOTE method is discussed in [137]. Boosting algorithms have also been studied for the problem of cost-sensitive learning. The AdaCost algorithm was proposed in [203]. Boosting techniques can also be combined with sampling methods, as in the case of the SMOTEBoost algorithm [138]. An evaluation of boosting algorithms for rare class detection is provided in [296]. Discussions of linear regression models and regression trees may be found in [110, 256, 391].
Recently, the semisupervised and active learning problems have been studied to use external information for better supervision. The co-training method was discussed in [100]. The EM algorithm for combining labeled and unlabeled data was proposed in [410]. Trans-ductive SVM methods are proposed in [293, 496]. The method in [293] is a scalable SVM method that uses an iterative approach. Graph-based methods for semisupervised learn-
386 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS
ing are discussed in [101, 294]. Surveys on semisupervised classification may be found in [33, 555].
A detailed survey on active learning may be found in [13, 454 ]. Methods for uncertainty sampling [345], query-by -committee [457], greatest model change [157], greatest error reduc-tion [158], and greatest variance reduction [158] have been proposed. Representativeness-based models have been discussed in [455]. Another form of active learning queries the data vertically. In other words, instead of examples, it is learned which attributes to collect to minimize the error at a given cost level [382].
The problem of meta-algorithm analysis has recently become very important because of its significant impact on improving the accuracy of classification algorithms. The bagging and random forests methods were proposed in [111, 112 ]. The boosting method was proposed in [215]. Bayesian model averaging and combination methods are proposed in [ 175]. The stacking method is discussed in [491, 513], and the bucket-of-models approach is explained in [541].
11.11 Exercises
Suppose that a classification training algorithm requires O(nr) time for training on a data set of size n. Here r is assumed to be larger than 1. Consider a data set D with an exactly even distribution across k different classes. Compare the running time of the one-against-rest approach with that of the one-against-one approach.
Discuss some general meta-strategies for speeding up classifiers. Discuss some strate-gies that you might use to scale up (a) nearest-neighbor classifiers, and (b) associative classifiers.
Describe the changes required to the dual formulation of the soft SVM classifier with hinge loss to make it a weighted classifier for rare-class learning.
Describe the changes required to the primal formulation of the soft SVM classifier with hinge loss to make it a weighted classifier for rare-class learning.
Implement the one-against-rest and one-against-one multiclass approach. Use the nearest-neighbor algorithm as the base classifier.
Design a semisupervised classification algorithm with the use of a supervised mod-ification of the k-means algorithm. How is this algorithm related to the EM-based semisupervised approach?
Suppose that your data was distributed into two thin and separated concentric rings, each of which belonged to one of the two classes. Suppose that you had only a small number of labeled examples from each of the rings but you had a lot of unlabeled examples. Would your rather use the (a) EM-algorithm, (b) transductive SVM algo-rithm, or the (c) graph-based approach for semisupervised classification? Why?
Write the optimization formulation for least-squares regression of the form y = W · X + b with a bias term b. Provide a closed-form solution for the optimal values of W and b in terms of the data matrix D and response variable vector y. Show that the optimal value of the bias term b always evaluates to 0 when the data matrix D and response variable vector y are both mean-centered.
Dostları ilə paylaş: |