Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	179/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 175 176 177 178 179 180 181 182 ... 423

1-Data Mining tarjima

Salary 50,000
Donor Non Donor
Name Donor Name Donor Name Donor
Sayani Y Jane
Victor Y
Age / 50 + Salary / 50,000 Age / 50 + Salary / 50,000 > 2
Non Donor Donor
Name Donor Name Donor
Nancy N Allen Y
Steve N Lata Y

Age < 50

Age > 50

Salary < 60,000

Salary >60,000

Salary > 50,000

Salary <

50,000

Non Donor

Donor

Non Donor

Donor

Name Donor?

Nancy

Allen

MaryN

Peter

Jim

Sayani

Jane

Lata

Steve

Victor

Dale

(a) Univariate split

Age / 50 + Salary / 50,000 < 2

Age / 50 + Salary / 50,000 > 2

Non Donor

Donor

Name Donor?

Nancy

Allen

Jim

Peter

Jane

Sayani

Steve

Lata

Mary

Victor

Dale

(b) Multivariate split

Figure 10.3: Illustration of univariate and multivariate splits for decision tree construction

a quality criterion for evaluating a split is available. The design of the split criterion depends on the nature of the underlying attribute:

Binary attribute: Only one type of split is possible, and the tree is always binary. Each branch corresponds to one of the binary values.

Categorical attribute: If a categorical attribute has r diﬀerent values, there are multiple ways to split it. One possibility is to use an r-way split, in which each branch of the split corresponds to a particular attribute value. The other possibility is to use a binary split by testing each of the 2r − 1 combinations (or groupings) of categorical attributes, and selecting the best one. This is obviously not a feasible option when the value of r is large. A simple approach that is sometimes used is to convert categorical

296 CHAPTER 10. DATA CLASSIFICATION

Algorithm GenericDecisionTree(Data Set: D)

begin
Create root node containing D;

repeat
Select an eligible node in the tree;

Split the selected node into two or more nodes based on a pre-defined split criterion;

until no more eligible nodes for split;

Prune overfitting nodes from tree;

Label each leaf node with its dominant class; end

Figure 10.4: Generic decision tree training algorithm

data to binary data with the use of the binarization approach discussed in Chap. 2.

In this case, the approach for binary attributes may be used.

Numeric attribute: If the numeric attribute contains a small number r of ordered values (e.g., integers in a small range [1, r]), it is possible to create an r-way split for each distinct value. However, for continuous numeric attributes, the split is typically performed by using a binary condition, such as x ≤ a, for attribute value x and constant a.

Consider the case where a node contains m data points. Therefore, there are m possible split points for the attribute, and the corresponding values of a may be determined by sorting the data in the node along this attribute. One possibility is to test all the possible values of a for a split and select the best one. A faster alternative is to test only a smaller set of possibilities for a, based on equi-depth division of the range.

Many of the aforementioned methods requires the determination of the “best” split from a set of choices. Specifically, it is needed to choose from multiple attributes and from the various alternatives available for splitting each attribute. Therefore, quantifications of split quality are required. These quantifications are based on the same principles as the feature selection criteria discussed in Sect. 10.2.

Error rate: Let p be the fraction of the instances in a set of data points S belonging to the dominant class. Then, the error rate is simply 1 − p. For an r-way split of set S into sets S₁ . . . S_r, the overall error rate of the split may be quantified as the weighted average of the error rates of the individual sets S_i, where the weight of S_i is |S_i|. The split with the lowest error rate is selected from the alternatives.

Gini index: The Gini index G(S) for a set S of data points may be computed according to Eq. 10.1 on the class distribution p₁ . . . p_k of the training data points in S.

k
G(S) = 1 − p_j²	(10.8)
j=1

The overall Gini index for an r - way split of set S into sets S₁ . . . S_r may be quantified as the weighted average of the Gini index values G(S_i) of each S_i, where the weight

10.3. DECISION TREES

297

of S_i is |S_i|.

|S_i|

Gini-Split(S

⇒

...S

) =

G(S

)

(10.9)

i=1

The split with the lowest Gini index is selected from the alternatives. The CART algorithm uses the Gini index as the split criterion.

Entropy: The entropy measure is used in one of the earliest classification algorithms, referred to as ID3. The entropy E(S) for a set S may be computed according to Eq. 10.3 on the class distribution p₁ . . . p_k of the training data points in the node.

k
E(S) = − p_j log₂(p_j )	(10.10)
j=1

As in the case of the Gini index, the overall entropy for an r-way split of set S into sets S₁ . . . S_r may be computed as the weighted average of the Gini index values G(S_i) of each S_i, where the weight of S_i is |S_i|.

Entropy-Split(S ⇒ S₁ . . . S_r) = ^r ^|^Sⁱ^|E(S_i) (10.11)

Lower values of the entropy are more desirable. The entropy measure is used by the ID3 and C4.5 algorithms.

The information gain is closely related to entropy, and is equal to the reduction in the entropy E(S) − Entropy-Split( S ⇒ S₁ . . . S_r) as a result of the split. Large values of the reduction are desirable. At a conceptual level, there is no diﬀerence between using either of the two for a split although a normalization for the degree of the split is possible in the case of information gain. Note that the entropy and information gain measures should be used only to compare two splits of the same degree because both measures are naturally biased in favor of splits with larger degree. For example, if a categorical attribute has many values, attributes with many values will be preferred. It has been shown by the C4.5 algorithm that dividing the overall information gain with

the normalization factor of −	r \|S_i\|	log₂(	\|S_i\|		) helps in adjusting for the varying

	i=1 \|S\|			\|S\|
number of categorical values.

The aforementioned criteria are used to select the choice of the split attribute and the precise criterion on the attribute. For example, in the case of a numeric database, diﬀerent split points are tested for each numeric attribute, and the best split is selected.

10.3.2 Stopping Criterion and Pruning

The stopping criterion for the growth of the decision tree is intimately related to the under-lying pruning strategy. When the decision tree is grown to the very end until every leaf node contains only instances belonging to a particular class, the resulting decision tree exhibits 100 % accuracy on instances belonging to the training data. However, it often generalizes poorly to unseen test instances because the decision tree has now overfit even to the random characteristics in the training instances. Most of this noise is contributed by the lower level nodes, which contain a smaller number of data points. In general, simpler models (shallow decision trees) are preferable to more complex models (deep decision trees) if they produce the same error on the training data.

298 CHAPTER 10. DATA CLASSIFICATION

To reduce the level of overfitting, one possibility is to stop the growth of the tree early. Unfortunately, there is no way of knowing the correct point at which to stop the growth of the tree. Therefore, a natural strategy is to prune overfitting portions of the decision tree and convert internal nodes to leaf nodes. Many diﬀerent criteria are available to decide whether a node should be pruned. One strategy is to explicitly penalize model complexity with the use of the minimum description length principle (MDL). In this approach, the cost of a tree is defined by a weighted sum of its (training data) error and its complexity (e.g., the number of nodes). Information-theoretic principles are used to measure tree complexity. Therefore, the tree is constructed to optimize the cost rather than only the error. The main problem with this approach is that the cost function is itself a heuristic that does not work consistently well across diﬀerent data sets. A simpler and more intuitive strategy is to a hold out a small fraction (say 20 %) of the training data and build the decision tree on the remaining data. The impact of pruning a node on the classification accuracy is tested on the holdout set. If the pruning improves the classification accuracy, then the node is pruned. Leaf nodes are iteratively pruned until it is no longer possible to improve the accuracy with pruning. Although such an approach reduces the amount of training data for building the tree, the impact of pruning generally outweighs the impact of training-data loss in the tree-building phase.

10.3.3 Practical Issues

Decision trees are simple to implement and highly interpretable. They can model arbitrarily complex decision boundaries, given suﬃcient training data. Even a univariate decision tree can model a complex decision boundary with piecewise approximations by building a suﬃ-ciently deep tree. The main problem is that the amount of training data required to properly approximate a complex boundary with a treelike model is very large, and it increases with data dimensionality. With limited training data, the resulting decision boundary is usually a rather coarse approximation of the true boundary. Overfitting is common in such cases. This problem is exacerbated by the sensitivity of the decision tree to the split criteria at the higher levels of the tree. A closely related family of classifiers, referred to as rule-based classifiers, is able to alleviate these eﬀects by moving away from the strictly hierarchical structure of a decision tree.

10.4 Rule-Based Classifiers

Rule-based classifiers use a set of “if–then” rules R = {R₁ . . . R_m} to match antecedents to consequents. A rule is typically expressed in the following form:

IF Condition THEN Conclusion.

The condition on the left-hand side of the rule, also referred to as the antecedent, may contain a variety of logical operators, such as < , ≤, > , =, ⊆, or ∈, which are applied to the feature variables. The right-hand side of the rule is referred to as the consequent, and it contains the class variable. Therefore, a rule R_i is of the form Q_i ⇒ c where Q_i is the antecedent, and c is the class variable. The “⇒” symbol denotes the “THEN” condition. The rules are generated from the training data during the training phase. The notation Q_i represents a precondition on the feature set. In some classifiers, such as association pattern classifiers, the precondition may correspond to a pattern in the feature space, though this may not always be the case. In general, the precondition may be any arbitrary condition

10.4. RULE-BASED CLASSIFIERS

299

on the feature variables. These rules are then used to classify a test instance. A rule is said to cover a training instance when the condition in its antecedent matches the training instance.

A decision tree may be viewed as a special case of a rule-based classifier, in which each path of the decision tree corresponds to a rule. For example, the decision tree in Fig. 10.3a corresponds to the following set of rules:

Age ≤ 50 AND Salary ≤ 60, 000 ⇒ ¬Donor
Age ≤ 50 AND Salary > 60, 000 ⇒ Donor
Age > 50 AND Salary ≤ 50, 000 ⇒ ¬Donor

Age > 50 AND Salary > 50, 000 ⇒ Donor

Note that each of the four aforementioned rules corresponds to a path in the decision tree of Fig. 10.3a. The logical expression on the left is expressed in conjunctive form, with a set of “AND” logical operators. Each of the primitive conditions in the antecedent, (such as Age ≤ 50) is referred to as a conjunct. The rule set from a training data set is not unique and depends on the specific algorithm at hand. For example, only two rules are generated from the decision tree in Fig. 10.3b.

Age/50 + Salary/50, 000 ≤ 2 ⇒ ¬Donor
Age/50 + Salary/50, 000 > 2 ⇒ Donor

As in decision trees, succinct rules, both in terms of the cardinality of the rule set and the number of conjuncts in each rule, are generally more desirable. This is because such rules are less likely to overfit the data, and will generalize well to unseen test instances. Note that the antecedents on the left-hand side always correspond to a rule condition. In many rule- based classifiers, such as association-pattern classifiers, the logical operators such as “⊆” are implicit and are omitted from the rule antecedent description. For example, con-sider the case where the age and salary are discretized into categorical attribute values.

Age [50 : 60], Salary [50, 000 : 60, 000] ⇒ Donor

In such a case, the discretized attributes for age and salary will be represented as “items,” and an association pattern-mining algorithm can discover the itemset on the left-hand side. The operator “⊆” is implicit in the rule antecedent. Associative classifiers are discussed in detail later in this section.

The training phase of a rule-based algorithm creates a set of rules. The classification phase for a test instance discovers all rules that are triggered by the test instance. A rule is said to be triggered by the test instance when the logical condition in the antecedent is satisfied by the test instance. In some cases, rules with conflicting consequent values are triggered by the test instance. In such cases, methods are required to resolve the conflicts in class label prediction. Rule sets may satisfy one or more of the following properties:

Mutually exclusive rules: Each rule covers a disjoint partition of the data. Therefore, at most one rule can be triggered by a test instance. The rules generated from a decision tree satisfy this property. However, if the extracted rules are subsequently modified to reduce overfitting (as in some classifiers such as C4.5rules), the resulting rules may no longer remain mutually exclusive.

Exhaustive rules: The entire data space is covered by at least one rule. Therefore, every test instance triggers at least one rule. The rules generated from a decision tree

300 CHAPTER 10. DATA CLASSIFICATION

also satisfy this property. It is usually easy to construct an exhaustive rule set by creating a single catch -all rule whose consequent contains the dominant class in the portion of the training data not covered by other rules.

It is relatively easy to perform the classification when a rule set satisfies both the aforemen-tioned properties. The reason for this is that each test instance maps to exactly one rule, and there are no conflicts in class predictions by multiple rules. In cases where rule sets are not mutually exclusive, conflicts in the rules triggered by a test instance can be resolved in one of two ways:

Rule ordering: The rules are ordered by priority, which may be defined in a variety of ways. One possibility is to use a quality measure of the rule for ordering. Some popular classification algorithms, such as C4.5rules and RIPPER, use class-based ordering, where rules with a particular class are prioritized over the other. The resulting set of ordered rules is also referred to as a decision list. For an arbitrary test instance, the class label in the consequent of the top triggered rule is reported as the relevant one for the test instance. Any other triggered rule is ignored. If no rule is triggered then a default catch-all class is reported as the relevant one.

Unordered rules: No priority is imposed on the rule ordering. The dominant class label among all the triggered rules may be reported. Such an approach can be more robust because it is not sensitive to the choice of the single rule selected by a rule-ordering scheme. The training phase is generally more eﬃcient because all rules can be extracted simultaneously with pattern-mining techniques without worrying about relative ordering. Ordered rule-mining algorithms generally have to integrate the rule ordering into the rule generation process with methods such as sequential covering, which are computationally expensive. On the other hand, the testing phase of an unordered approach can be more expensive because of the need to compare a test instance against all the rules.

How should the diﬀerent rules be ordered for test instance classification? The first possibility is to order the rules on the basis of a quality criterion, such as the confidence of the rule, or a weighted measure of the support and confidence. However, this approach is rarely used. In most cases, the rules are ordered by class. In some rare class applications, it makes sense to order all rules belonging to the rare class first. Such an approach is used by RIPPER. In other classifiers, such as C4.5rules, various accuracy and information-theoretic measures are used to prioritize classes.

10.4.1 Rule Generation from Decision Trees

As discussed earlier in this section, rules can be extracted from the diﬀerent paths in a decision tree. For example, C4.5rules extracts the rules from the C4.5 decision tree. The sequence of split criteria on each path of the decision tree corresponds to the antecedent of a corresponding rule. Therefore, it would seem at first sight that rule ordering is not needed because the generated rules are exhaustive and mutually exclusive. However, the rule-extraction process is followed by a rule-pruning phase in which many conjuncts are pruned from the rules to reduce overfitting. Rules are processed one by one, and conjuncts are pruned from them in greedy fashion to improve the accuracy as much as possible on the covered examples in a separate holdout validation set. This approach is similar to decision tree pruning except that one is no longer restricted to pruning the conjuncts at the lower levels of the decision tree. Therefore, the pruning process is more flexible than that of a

10.4. RULE-BASED CLASSIFIERS

301

decision tree, because it is not restricted by an underlying tree structure. Duplicate rules may result from pruning of conjuncts. These rules are removed. The rule-pruning phase increases the coverage of the individual rules and, therefore, the mutually exclusive nature of the rules is lost. As a result, it again becomes necessary to order the rules.

In C4.5rules, all rules that belong to the class whose rule set has the smallest description length are prioritized over other rules. The total description length of a rule set is a weighted sum of the number of bits required to encode the size of the model (rule set) and the number of examples covered by the class-specific rule set in the training data, which belong to a diﬀerent class. Typically, classes with a smaller number of training examples are favored by this approach. A second approach is to order the class first whose rule set has the least number of false-positive errors on a separate holdout set. A rule-based version of a decision tree generally allows the construction of a more flexible decision boundary with limited training data than the base tree from which the rules are generated. This is primarily because of the greater flexibility in the model which is no longer restrained by the straitjacket of an exhaustive and mutually exclusive rule set. As a result, the approach generalizes better to unseen test instances.

10.4.2 Sequential Covering Algorithms

Sequential covering methods are used frequently for creating ordered rule lists. Thus, in this case, the classification process uses the top triggered rule to classify unseen test instances. Examples of sequential covering algorithms include AQ, CN2, and RIPPER. The sequential covering approach iteratively applies the following two steps to grow the rules from the training data set D until a stopping criterion is met:

(Learn-One-Rule) Select a particular class label and determine the “best” rule from the current training instances D with this class label as the consequent. Add this rule to the bottom of the ordered rule list.

(Prune training data) Remove the training instances in D that are covered by the rule learned in the previous step. All training instances matching the antecedent of the rule must be removed, whether or not the class label of the training instance matches the consequent.

The aforementioned generic description applies to all sequential covering algorithms. The various sequential covering algorithms mainly diﬀer in the details of how the rules are ordered with respect to each other.

Class-based ordering: In most sequential covering algorithms such as RIPPER, all rules corresponding to a particular class are generated and placed contiguously on the ordered list. Typically, rare classes are ordered first. Therefore, classes that are placed earlier on the list may be favored more than others. This can sometimes cause artificially lower accuracy for test instances belonging to the less favored class.

When class-based ordering is used, the rules for a particular class are generated con-tiguously. The addition of rules for each class has a stopping criterion that is algorithm dependent. For example, RIPPER uses an MDL criterion that stops adding rules when further addition increases the description length of the model by at least a predefined number of units. Another simpler stopping criterion is when the error rate of the next generated rule on a separate validation set exceeds a predefined threshold. Finally, one might simply use a threshold on the number of uncovered training instances remain-ing for a class as the class-specific stopping criterion. When the number of uncovered

302 CHAPTER 10. DATA CLASSIFICATION

training instances remaining for a class falls below a threshold, rules for that class consequent are no longer grown. At this point, rules corresponding to the next class are grown. For a k-class problem, this approach is repeated (k − 1) times. Rules for the kth class are not grown. The least prioritized rule is a single catch-all rule with its consequent as the kth class. When the test instance does not fire rules belonging to the other classes, this class is assumed as the relevant label.

Quality-based ordering: In some covering algorithms, class-based ordering is not used. A quality measure is used to select the next rule. For example, one might generate the rule with the highest confidence in the remaining training data. The catch-all rule corresponds to the dominant class among remaining test instances. Quality-based ordering is rarely used in practice because of the diﬃculty in interpreting a quality criterion which is defined only over the remaining test instances.

Because class-based ordering is more common, the Learn-One-Rule procedure will be described under this assumption.

10.4.2.1 Learn-One-Rule

The Learn-One-Rule procedure grows rules from the general to the specific, in much the same way a decision tree grows a tree hierarchically from general nodes to specific nodes. Note that a path in a decision tree is a rule in which the antecedent corresponds to the conjunction of the split criteria at the diﬀerent nodes, and the consequent corresponds to the label of the leaf nodes. While a decision tree grows many diﬀerent disjoint paths at one time, the Learn-One-Rule procedure grows a single “best” path. This is yet another example of the close relationship between decision trees and rule-based methods.

The idea of Learn-One-Rule is to successively add conjuncts to the left-hand side of the rule to grow a single decision path (rather than a decision tree) based on a quality criterion. The root of the tree corresponds to the rule {} ⇒ c . The class c represents the consequent of the rule being grown. In the simplest version of the procedure, a single path is grown at one time by successively adding conjuncts to the antecedent. In other words, conjuncts are added to increase the quality as much as possible. The simplest quality criterion is the accuracy of the rule. The problem with this criterion is that rules with high accuracy but very low coverage are generally not desirable because of overfitting. The precise choice of the quality criterion that regulates the trade-oﬀ between accuracy and coverage will be discussed in detail later. As in the case of a decision tree, various logical conditions (or split choices) must be tested to determine the best conjunct to be added. The process of enumeration of the various split choices is similar to a decision tree. The rule is grown until a particular stopping criterion is met. A natural stopping criterion is one where the quality of the rule does not improve by further growth.

One challenge with the use of this procedure is that if a mistake is made early on during tree growth, it will lead to suboptimal rules. One way of reducing the likelihood of suboptimal rules is to always maintain the m best paths during rule-growth rather than a single one. An example of rule growth with the use of a single decision path, for the donor example of Table 10.1, is illustrated in Fig. 10.5. In this case, the rule is grown for the donor class. The first conjunct added is Age > 50, and the second conjunct added is Salary > 50, 000. Note the intuitive similarity between the decision tree of Figs. 10.3a and 10.5.

It remains to describe the quality criterion for the growth of the paths during the Learn-One-Rule procedure. On what basis is a particular path selected over the others? The

10.4. RULE-BASED CLASSIFIERS						303
		{ }

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 175 176 177 178 179 180 181 182 ... 423