Data Mining: The Textbook

|ΔA − 0|

Yüklə 17,13 Mb.

səhifə	203/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 199 200 201 202 203 204 205 206 ... 423

1-Data Mining tarjima

b	(δa_i − ΔA)²
	(δa_i − ΔA)²
σ =	i=1	.	(10.79)
σ =		.	(10.79)	b − 1
				b − 1

Note that the sign of ΔA tells us which classifier is better than the other. For example, if ΔA > 0 then model M₁ has higher average accuracy than M₂. In such a case, it is desired

10.9. CLASSIFIER EVALUATION

339

to determine a statistical measure of the confidence (or, a probability value) that M₁ is truly better than M₂.

The idea here is to assume that the diﬀerent samples δa₁ . . . δa_b are sampled from a normal distribution. Therefore, the estimated mean and standard deviations of this distri-
bution are given by ΔA and σ, respectively. The standard deviation of the estimated mean
√
ΔA of b samples is therefore σ/ b according to the central-limit theorem. Then, the number

of standard deviations s by which ΔA is diﬀerent from the break-even accuracy diﬀerence of 0 is as follows:

When b is large, the standard normal distribution with zero mean and unit variance can be used to quantify the probability that one classifier is truly better than the other. The probability in any one of the symmetric tails of the standard normal distribution, more than s standard deviations away from the mean, provides the probability that this variation is not significant, and it might be a result of chance. This probability is subtracted from 1 to determine the confidence that one classifier is truly better than the other.

It is often computationally expensive to use large values of b. In such cases, it is no longer possible to estimate the standard deviation σ robustly with the use of a small num-ber b of samples. To adjust for this, the Student’s t-distribution with (b − 1) degrees of freedom is used instead of the normal distribution. This distribution is very similar to the normal distribution, except that it has a heavier tail to account for the greater estimation uncertainty. In fact, for large values of b, the t-distribution with (b − 1) degrees of freedom converges to the normal distribution.

10.9.2.2 Output as Numerical Score

In many scenarios, the output of the classification algorithm is reported as a numerical score associated with each test instance and label value. In cases where the numerical score can be reasonably compared across test instances (e.g., the probability values returned by a Bayes classifier), it is possible to compare the diﬀerent test instances in terms of their relative propensity to belong to a specific class. Such scenarios are more common when one of the classes of interest is rare. Therefore, for this scenario, it is more meaningful to use the binary class scenario where one of the classes is the positive class, and the other class is the negative class. The discussion below is similar to the discussion in Sect. 8.8.2 of Chap. 8 on external validity measures for outlier analysis. This similarity arises from the fact that outlier validation with class labels is identical to classifier evaluation.

The advantage of a numerical score is that it provides more flexibility in evaluating the overall trade-oﬀ between labeling a varying number of data points as positives. This is achieved by using a threshold on the numerical score for the positive class to define the binary label. If the threshold is selected too aggressively to minimize the number of declared positive class instances, then the algorithm will miss true-positive class instances (false negatives). On the other hand, if the threshold is chosen in a more relaxed way, this will lead to too many false positives. This leads to a trade-oﬀ between the false positives and false negatives. The problem is that the “correct” threshold to use is never known exactly in a real scenario. However, the entire trade-oﬀ curve can be quantified using a variety of measures, and two algorithms can be compared over the entire trade-oﬀ curve. Two examples of such

curves are the precision–recall curve, and the receiver operating characteristic (ROC) curve. For any given threshold t on the predicted positive-class score, the declared positive class set is denoted by S(t). As t changes, the size of S(t) changes as well. Let G represent

340 CHAPTER 10. DATA CLASSIFICATION

the true set (ground-truth set) of positive instances in the data set. Then, for any given threshold t, the precision is defined as the percentage of reported positives that truly turn out to be positive.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 199 200 201 202 203 204 205 206 ... 423