Label prediction: In this case, a label is predicted for each test instance.
Numerical score: In most cases, the learner assigns a score to each instance–label combination that measures the propensity of the instance to belong to a particular class. This score can be easily converted to a label prediction by using either the maximum value, or a cost-weighted maximum value of the numerical score across different classes. One advantage of using a score is that different test instances can be compared and ranked by their propensity to belong to a particular class. Such scores are particularly useful in situations where one of the classes is very rare, and a numerical score provides a way to determine the top ranked candidates belonging to that class.
A subtle but important distinction exists in the design process of these two types of models, especially when numerical scores are used for ranking different test instances. In the first model, the training model does not need to account for the relative classification propensity across different test instances. The model only needs to worry about the relative propensity towards different labels for a specific instance. The second model also needs to properly normalize the classification scores across different test instances so that they can be mean-ingfully compared for ranking. Minor variations of most classification models are able to handle either the labeling or the ranking scenario.
When the training data set is small, the performance of classification models is sometimes poor. In such cases, the model may describe the specific random characteristics of the training data set, and it may not generalize to the group structure of previously unseen test instances. In other words, such models might accurately predict the labels of instances used to construct them, but they perform poorly on unseen test instances. This phenomenon is referred to as overfitting. This issue will be revisited several times in this chapter and the next.
Various models have been designed for data classification. The most well-known ones include decision trees, rule-based classifiers, probabilistic models, instance-based classifiers, support vector machines, and neural networks. The modeling phase is often preceded by a feature selection phase to identify the most informative features for classification. Each of these methods will be addressed in this chapter.
This chapter is organized as follows. Section 10.2 introduces some of the common models used for feature selection. Decision trees are introduced in Sect. 10.3. Rule-based classifiers are introduced in Sect. 10.4. Section 10.5 discusses probabilistic models for data classi-fication. Section 10.6 introduces support vector machines. Neural network classifiers are discussed in Sect. 10.7. Instance -based learning methods are explained in Sect. 10.8. Eval-uation methods are discussed in Sect. 10.9. The summary is presented in Sect. 10.10.
10.2 Feature Selection for Classification
Feature selection is the first stage in the classification process. Real data may contain features of varying relevance for predicting class labels. For example, the gender of a person is less relevant for predicting a disease label such as “diabetes,” as compared to his or
288 CHAPTER 10. DATA CLASSIFICATION
her age. Irrelevant features will typically harm the accuracy of the classification model in addition to being a source of computational inefficiency. Therefore, the goal of feature selection algorithms is to select the most informative features with respect to the class label. Three primary types of methods are used for feature selection in classification.
Filter models: A crisp mathematical criterion is available to evaluate the quality of a feature or a subset of features. This criterion is then used to filter out irrelevant features.
Dostları ilə paylaş: |