b
|
(δai − ΔA)2
|
|
|
|
σ =
|
i=1
|
.
|
(10.79)
|
|
|
b − 1
|
|
|
|
|
|
|
Note that the sign of ΔA tells us which classifier is better than the other. For example, if ΔA > 0 then model M1 has higher average accuracy than M2. In such a case, it is desired
10.9. CLASSIFIER EVALUATION
|
339
|
to determine a statistical measure of the confidence (or, a probability value) that M1 is truly better than M2.
The idea here is to assume that the different samples δa1 . . . δab are sampled from a normal distribution. Therefore, the estimated mean and standard deviations of this distri-
bution are given by ΔA and σ, respectively. The standard deviation of the estimated mean
√
ΔA of b samples is therefore σ/ b according to the central-limit theorem. Then, the number
of standard deviations s by which ΔA is different from the break-even accuracy difference of 0 is as follows:
|
√
|
|
|ΔA − 0|
|
.
|
|
|
s =
|
b
|
(10.80)
|
|
|
|
|
|
|
|
σ
|
|
|
When b is large, the standard normal distribution with zero mean and unit variance can be used to quantify the probability that one classifier is truly better than the other. The probability in any one of the symmetric tails of the standard normal distribution, more than s standard deviations away from the mean, provides the probability that this variation is not significant, and it might be a result of chance. This probability is subtracted from 1 to determine the confidence that one classifier is truly better than the other.
It is often computationally expensive to use large values of b. In such cases, it is no longer possible to estimate the standard deviation σ robustly with the use of a small num-ber b of samples. To adjust for this, the Student’s t-distribution with (b − 1) degrees of freedom is used instead of the normal distribution. This distribution is very similar to the normal distribution, except that it has a heavier tail to account for the greater estimation uncertainty. In fact, for large values of b, the t-distribution with (b − 1) degrees of freedom converges to the normal distribution.
10.9.2.2 Output as Numerical Score
In many scenarios, the output of the classification algorithm is reported as a numerical score associated with each test instance and label value. In cases where the numerical score can be reasonably compared across test instances (e.g., the probability values returned by a Bayes classifier), it is possible to compare the different test instances in terms of their relative propensity to belong to a specific class. Such scenarios are more common when one of the classes of interest is rare. Therefore, for this scenario, it is more meaningful to use the binary class scenario where one of the classes is the positive class, and the other class is the negative class. The discussion below is similar to the discussion in Sect. 8.8.2 of Chap. 8 on external validity measures for outlier analysis. This similarity arises from the fact that outlier validation with class labels is identical to classifier evaluation.
The advantage of a numerical score is that it provides more flexibility in evaluating the overall trade-off between labeling a varying number of data points as positives. This is achieved by using a threshold on the numerical score for the positive class to define the binary label. If the threshold is selected too aggressively to minimize the number of declared positive class instances, then the algorithm will miss true-positive class instances (false negatives). On the other hand, if the threshold is chosen in a more relaxed way, this will lead to too many false positives. This leads to a trade-off between the false positives and false negatives. The problem is that the “correct” threshold to use is never known exactly in a real scenario. However, the entire trade-off curve can be quantified using a variety of measures, and two algorithms can be compared over the entire trade-off curve. Two examples of such
curves are the precision–recall curve, and the receiver operating characteristic (ROC) curve. For any given threshold t on the predicted positive-class score, the declared positive class set is denoted by S(t). As t changes, the size of S(t) changes as well. Let G represent
340 CHAPTER 10. DATA CLASSIFICATION
the true set (ground-truth set) of positive instances in the data set. Then, for any given threshold t, the precision is defined as the percentage of reported positives that truly turn out to be positive.
Dostları ilə paylaş: |