Donor
|
|
Age < 50
|
|
|
Salary > 50,000
|
|
|
|
|
|
|
|
Age > 50
|
Salary < 50,000
|
|
|
|
|
|
|
|
|
(Age > 50)
|
|
Donor
|
|
|
|
|
Salary < 50,000
Salary > 50,000
(Age > 50 ) AND (Salary > 50,000) Donor
Figure 10.5: Rule growth is analogous to decision tree construction
similarity between rule growth and decision trees suggests the use of analogous measures such as the accuracy, entropy, or the Gini index, as used for split criteria in decision trees.
The criteria do need to be modified because a rule is relevant only to the training exam-ples covered by the antecedent and the single class in the consequent, whereas decision tree splits are evaluated with respect to all training examples at a given node and all classes. Furthermore, decision tree split measures do not need to account for issues such as the coverage of the rule. One would like to determine rules with high coverage in order to avoid overfitting. For example, a rule that covers only a single training instance will always have 100 % accuracy, but it does not usually generalize well to unseen test instances. There-fore, one strategy is to combine the accuracy and coverage criteria into a single integrated measure.
The simplest combination approach is to use Laplacian smoothing with a parameter β that regulates the level of smoothing in a training data set with k classes:
Laplace(β) =
|
n+ + β
|
(10.12)
|
|
n+ + n− + kβ .
|
|
The parameter β > 0 controls the level of smoothing, n+ represents the number of cor-rectly classified (positive) examples covered by the rule and n− represents the number of incorrectly classified (negative) examples covered by the rule. Therefore, the total number of covered examples is n+ + n−. For cases where the absolute number of covered exam-ples n+ + n− is very small, Laplacian smoothing penalizes the accuracy to account for the unreliability of low coverage. Therefore, the measure favors greater coverage.
A second possibility is the likelihood ratio statistic. Let nj be the observed number of training data points covered by the rule that belong to class j , and let nej be the expected number of covered examples that would belong to class j, if the class distribution of the covered examples is the same as the full training data. In other words, if p1 . . . pk be the fraction of examples belonging to each class in the full training data, then we have:
Dostları ilə paylaş: |