Difficult classification scenarios: Many scenarios of the classification problem are much more challenging. These include multiclass scenarios, rare-class scenarios, and cases where the size of the training data is large.
Enhancing classification: Classification methods can be enhanced with additional data-centric input, user-centric input, or multiple models.
The difficult classification scenarios that are addressed in this chapter are as follows:
Multiclass learning: Although many classifiers such as decision trees, Bayesian meth-ods, and rule-based classifiers, can be directly used for multiclass learning, some of the models, such as support-vector machines, are naturally designed for binary classifi-cation. Therefore, numerous meta-algorithms have been designed for adapting binary classifiers to multiclass learning.
Rare class learning: The positive and negative examples may be imbalanced. In other words, the data set contains only a small number of positive examples. A direct use of traditional learning models may often result in the classifier assigning all examples to the negative class. Such a classification is not very informative for imbalanced scenarios in which misclassification of the rare class incurs much higher cost than misclassification of the normal class.
C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 11
|
345
|
c Springer International Publishing Switzerland 2015
346 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS
Scalable learning: The sizes of typical training data sets have increased significantly in recent years. Therefore, it is important to design models that can perform the learning in a scalable way. In cases where the data is not memory resident, it is important to design algorithms that can minimize disk accesses.
Numeric class variables: Most of the discussion in this book assumes that the class variables are categorical. Suitable modifications are required to classification algo-rithms, when the class variables are numeric. This problem is also referred to as regression modeling.
The addition of more training data or the simultaneous use of a larger number of classifica-tion models can improve the learning accuracy. A number of methods have been proposed to enhance classification methods. Examples include the following:
Semisupervised learning: In these cases, unlabeled examples are used to improve the effectiveness of classifiers. Although unlabeled data does not contain any information about the label distribution, it does contain a significant amount of information about the manifold and clustering structure of the underlying data. Because the classification problem is a supervised version of the clustering problem, this connection can be leveraged to improve the classification accuracy. The core idea is that in most real data sets, labels vary in a smooth way over dense regions of the data. The determination of dense regions in the data only requires unlabeled information.
Active learning: In real life, it is often expensive to acquire labels. In active learn-ing, the user (or an oracle) is actively involved in determining the most informative examples for which the labels need to be acquired. Typically, these are examples that provide the user the more accurate knowledge about the uncertain regions in the data, where the distribution of the class label is unknown.
Dostları ilə paylaş: |