Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə205/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   201   202   203   204   205   206   207   208   ...   423
1-Data Mining tarjima

10.9. CLASSIFIER EVALUATION

341







Table 10.2: Rank of ground-truth positive instances






















Algorithm

Rank of positive class instances








































Algorithm A

1, 5, 8, 15, 20










Algorithm B

3, 7, 11, 13, 15










Random Algorithm

17, 36, 45, 59, 66










Perfect Oracle

1,2,3,4,5























TRUE POSITIVE RATE (RECALL)




100





































100


































90





































90



























































































ALGORITHM A













80





































80
















ALGORITHM B






































































RANDOM ALGORITHM










70





































70
















PERFECT ORACLE












































































60


































PRECISION

60


































50


































50







































































































40






















ALGORITHM A










40







































































































30






















ALGORITHM B










30























































RANDOM ALGORITHM
















































































































20






















PERFECT ORACLE










20







































































































10





































10


































0

0

10

20

30

40

50

60

70

80

90

100




00

10

20

30

40

50

60

70

80

90

100



















FALSE POSITIVE RATE































RECALL






















(a) ROC (b) Precision-Recall

Figure 10.13: ROC curve and precision–recall curves


five ground-truth positive label instances have been illustrated for the different algorithms. In addition, ranks of the ground-truth positives for a random algorithm have been indicated. The random algorithm outputs a random score for each data point. Similarly, the ranks for a “perfect oracle” algorithm that ranks the correct top five points to belong to the positive class have also been illustrated in the table. The resulting ROC curves are illustrated in Fig. 10.13a. The corresponding precision–recall curves are illustrated in Fig. 10.13b. While the precision–recall curves are not quite as nicely interpretable as the ROC curves, it is easy to see that the relative trends between different algorithms, are the same in both cases. In general, ROC curves are used more frequently because of the ease in interpreting the quality of the algorithm with respect to a random classifier.

What do these curves really tell us? For cases in which one curve strictly dominates another, it is clear that the algorithm for the former curve is superior. For example, it is immediately evident that the oracle algorithm is superior to all algorithms, and the random algorithm is inferior to all the other algorithms. On the other hand, algorithms A and B show domination at different parts of the ROC curve. In such cases, it is hard to say that one algorithm is strictly superior. From Table 10.2, it is clear that Algorithm A ranks three of the correct positive instances very highly, but the remaining two positive instances are ranked poorly. In the case of Algorithm B, the highest ranked positive instances are not as well ranked as Algorithm A, though all five positive instances are determined much earlier in terms of rank threshold. Correspondingly, Algorithm A dominates on the earlier part of the ROC curve, whereas Algorithm B dominates on the later part. It is possible to use the area under the ROC curve as a proxy for the overall effectiveness of the algorithm.



342 CHAPTER 10. DATA CLASSIFICATION

10.10 Summary


The problem of data classification can be considered a supervised version of data clustering, in which a predefined set of groups is provided to a learner. This predefined set of groups is used for training the classifier to categorize unseen test examples into groups. A wide variety of models have been proposed for data classification.


Decision trees create a hierarchical model of the training data. For each test instance, the optimal path in the tree is used to classify unseen test instances. Each path in the tree can be viewed as a rule that is used to classify unseen test instances. Rule-based classifiers can be viewed as a generalization of decision trees, in which the classifier is not necessarily restricted to characterizing the data in a hierarchical way. Therefore, multiple conflicting rules can be used to cover the same training or test instance. Probabilistic classifiers map feature values to unseen test instances with probabilities. The naive Bayes rule or a logistic function may be used for effective estimation of probabilities. SVMs and neural networks are two forms of linear classifiers. The objective functions that are optimized are different. In the case of SVMs, the maximum margin principle is used, whereas for neural networks, the least squares error of prediction is approximately optimized. Instance-based learning methods are classifiers that delay learning to classification time as opposed to eager learners that construct the classification models up front. The simplest form of instance - based learning is the nearest-neighbor classifier. Many complex variations are possible by using different types of distance functions and locality-centric models.


Classifier evaluation is important for testing the relative effectiveness of different train-ing models. Numerous models such as holdout, stratified sampling, bootstrap, and cross-validation have been proposed in the literature. Classifier evaluation can be performed either in the context of label assignment or numerical scoring. For label assignment, either the accuracy or the cost-sensitive accuracy may be used. For numerical scoring, the ROC curve is used to quantify the trade-off between the true-positive and false-positive rates.


10.11 Bibliographic Notes


The problem of data classification has been studied extensively by the data mining, machine learning, and pattern recognition communities. A number of books on these topics are available from these different communities [33, 95, 189, 256, 389]. Two surveys on the topic of data classification may be found in [286, 330]. A recent book [33] contains surveys on various aspects of data classification.


Feature selection is an important problem in data classification, to ensure the modeling algorithm is not confused by noise in the training data. Two books on feature selection may be found in [360, 366]. Fisher’s discriminant analysis was first proposed in [207], although a slightly different variant with the assumption of normally distributed data used in linear discriminant analysis [379]. The most well-known decision tree algorithms include ID3 [431], C4.5 [430], and CART [110 ]. Decision tree methods are also used in the context of multi-variate splits [116], though these methods are computationally more challenging. Surveys on decision tree algorithms may be found in [121, 393, 398]. Decision trees can be converted into rule-based classifiers where the rules are mutually exclusive. For example, the C4.5 method has also been extended to the C4.5rules algorithm [430]. Other popular rule-based systems include AQ [386], CN2 [177], and RIPPER [178]. Much of the discussion in this chapter was based on these algorithms. Popular associative classification algorithms include CBA [358], CPAR [529], and CMAR [349]. Methods for classification with discriminative




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   201   202   203   204   205   206   207   208   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin