Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	205/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 201 202 203 204 205 206 207 208 ... 423

1-Data Mining tarjima

10.9. CLASSIFIER EVALUATION
341
		Table 10.2: Rank of ground-truth positive instances

	Algorithm		Rank of positive class instances


	Algorithm A		1, 5, 8, 15, 20
	Algorithm B		3, 7, 11, 13, 15
	Random Algorithm		17, 36, 45, 59, 66
	Perfect Oracle		1,2,3,4,5

TRUE POSITIVE RATE (RECALL)

100													100
90													90
																			ALGORITHM A
80													80						ALGORITHM B
																			RANDOM ALGORITHM
70													70						PERFECT ORACLE
																			PERFECT ORACLE
60												PRECISION	60
50												PRECISION	50
50													50
40								ALGORITHM A					40
								ALGORITHM A
30								ALGORITHM B					30
30								RANDOM ALGORITHM					30
								RANDOM ALGORITHM
20								PERFECT ORACLE					20
20													20
10													10
0	0	10	20	30	40	50	60	70	80	90	100		0₀	10	20	30	40	50	60	70	80	90	100
					FALSE POSITIVE RATE													RECALL

(a) ROC (b) Precision-Recall

Figure 10.13: ROC curve and precision–recall curves

five ground-truth positive label instances have been illustrated for the diﬀerent algorithms. In addition, ranks of the ground-truth positives for a random algorithm have been indicated. The random algorithm outputs a random score for each data point. Similarly, the ranks for a “perfect oracle” algorithm that ranks the correct top five points to belong to the positive class have also been illustrated in the table. The resulting ROC curves are illustrated in Fig. 10.13a. The corresponding precision–recall curves are illustrated in Fig. 10.13b. While the precision–recall curves are not quite as nicely interpretable as the ROC curves, it is easy to see that the relative trends between diﬀerent algorithms, are the same in both cases. In general, ROC curves are used more frequently because of the ease in interpreting the quality of the algorithm with respect to a random classifier.

What do these curves really tell us? For cases in which one curve strictly dominates another, it is clear that the algorithm for the former curve is superior. For example, it is immediately evident that the oracle algorithm is superior to all algorithms, and the random algorithm is inferior to all the other algorithms. On the other hand, algorithms A and B show domination at diﬀerent parts of the ROC curve. In such cases, it is hard to say that one algorithm is strictly superior. From Table 10.2, it is clear that Algorithm A ranks three of the correct positive instances very highly, but the remaining two positive instances are ranked poorly. In the case of Algorithm B, the highest ranked positive instances are not as well ranked as Algorithm A, though all five positive instances are determined much earlier in terms of rank threshold. Correspondingly, Algorithm A dominates on the earlier part of the ROC curve, whereas Algorithm B dominates on the later part. It is possible to use the area under the ROC curve as a proxy for the overall eﬀectiveness of the algorithm.

342 CHAPTER 10. DATA CLASSIFICATION

10.10 Summary

The problem of data classification can be considered a supervised version of data clustering, in which a predefined set of groups is provided to a learner. This predefined set of groups is used for training the classifier to categorize unseen test examples into groups. A wide variety of models have been proposed for data classification.

Decision trees create a hierarchical model of the training data. For each test instance, the optimal path in the tree is used to classify unseen test instances. Each path in the tree can be viewed as a rule that is used to classify unseen test instances. Rule-based classifiers can be viewed as a generalization of decision trees, in which the classifier is not necessarily restricted to characterizing the data in a hierarchical way. Therefore, multiple conflicting rules can be used to cover the same training or test instance. Probabilistic classifiers map feature values to unseen test instances with probabilities. The naive Bayes rule or a logistic function may be used for eﬀective estimation of probabilities. SVMs and neural networks are two forms of linear classifiers. The objective functions that are optimized are diﬀerent. In the case of SVMs, the maximum margin principle is used, whereas for neural networks, the least squares error of prediction is approximately optimized. Instance-based learning methods are classifiers that delay learning to classification time as opposed to eager learners that construct the classification models up front. The simplest form of instance - based learning is the nearest-neighbor classifier. Many complex variations are possible by using diﬀerent types of distance functions and locality-centric models.

Classifier evaluation is important for testing the relative eﬀectiveness of diﬀerent train-ing models. Numerous models such as holdout, stratified sampling, bootstrap, and cross-validation have been proposed in the literature. Classifier evaluation can be performed either in the context of label assignment or numerical scoring. For label assignment, either the accuracy or the cost-sensitive accuracy may be used. For numerical scoring, the ROC curve is used to quantify the trade-oﬀ between the true-positive and false-positive rates.

10.11 Bibliographic Notes

The problem of data classification has been studied extensively by the data mining, machine learning, and pattern recognition communities. A number of books on these topics are available from these diﬀerent communities [33, 95, 189, 256, 389]. Two surveys on the topic of data classification may be found in [286, 330]. A recent book [33] contains surveys on various aspects of data classification.

Feature selection is an important problem in data classification, to ensure the modeling algorithm is not confused by noise in the training data. Two books on feature selection may be found in [360, 366]. Fisher’s discriminant analysis was first proposed in [207], although a slightly diﬀerent variant with the assumption of normally distributed data used in linear discriminant analysis [379]. The most well-known decision tree algorithms include ID3 [431], C4.5 [430], and CART [110 ]. Decision tree methods are also used in the context of multi-variate splits [116], though these methods are computationally more challenging. Surveys on decision tree algorithms may be found in [121, 393, 398]. Decision trees can be converted into rule-based classifiers where the rules are mutually exclusive. For example, the C4.5 method has also been extended to the C4.5rules algorithm [430]. Other popular rule-based systems include AQ [386], CN2 [177], and RIPPER [178]. Much of the discussion in this chapter was based on these algorithms. Popular associative classification algorithms include CBA [358], CPAR [529], and CMAR [349]. Methods for classification with discriminative

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 201 202 203 204 205 206 207 208 ... 423