F =
|
j=1 pj (μj − μ)2
|
.
|
(10.5)
|
|
k
|
|
|
|
|
|
|
j=1 pj σj2
|
|
|
|
The numerator quantifies the average interclass separation, whereas the denominator quan-tifies the average intraclass separation. The attributes with the largest value of the Fisher score may be selected for use with the classification algorithm.
10.2.1.4 Fisher’s Linear Discriminant
Fisher’s linear discriminant may be viewed as a generalization of the Fisher score in which newly created features correspond to linear combinations of the original features rather than a subset of the original features. This direction is designed to have a high level of discriminatory power with respect to the class labels. Fisher’s discriminant can be viewed as a supervised dimensionality reduction method in contrast to PCA, which maximizes the preserved variance in the feature space but does not maximize the class- specific discrimi-nation. For example, the most discriminating direction is aligned with the highest variance direction in Fig. 10.2a, but it is aligned with the lowest variance direction in Fig. 10.2b. In each case, if the data were to be projected along the most discriminating direction W , then the ratio of interclass to intraclass separation is maximized. How can we determine such a d-dimensional vector W ?
The selection of a direction with high discriminative power is based on the same quan-tification as the Fisher score. Fisher’s discriminant is naturally defined for the two-class scenario, although generalizations exist for the case of multiple classes. Let μ0 and μ1 be the d-dimensional row vectors representing the means of the data points in the two classes, and let Σ0 and Σ1 be the corresponding d × d covariance matrices in which the (i, j)th entry represents the covariance between dimensions i and j for that class. The fractional presence of the two classes are denoted by p0 and p1, respectively. Then, the equivalent Fisher score F S (W ) for a d-dimensional row vector W may be written in terms of scatter matrices, which are weighted versions of covariance matrices:
|
|
|
Between Class Scatter along
|
|
|
|
(
|
|
|
·
|
|
−
|
|
·
|
|
)2
|
|
|
|
|
W
|
|
W
|
W
|
|
F S(
|
|
) =
|
|
μ1
|
μ0
|
|
W
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
∝
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Within Class Scatter along
|
|
|
|
p0[Variance(Class 0)] + p1[Variance(Class 1)]
|
|
|
|
|
|
W
|
|
|
|
|
|
|
[(
|
|
|
−
|
|
|
)T (
|
|
−
|
|
|
)]
|
|
|
T
|
|
|
|
[
|
|
· (
|
|
−
|
|
)]2
|
|
|
|
|
|
|
|
|
|
|
|
W
|
W
|
|
|
|
W
|
|
|
|
|
|
|
|
|
|
|
=
|
μ1
|
μ0
|
μ1
|
μ0
|
=
|
|
|
μ1
|
μ0
|
.
|
|
|
|
|
|
|
|
|
|
p0[
|
|
|
Σ0
|
|
|
T ] + p1[
|
|
|
Σ1
|
|
|
T ]
|
|
|
|
(p0Σ0 + p1Σ1)
|
|
T
|
|
|
|
|
|
|
|
|
|
|
|
W
|
W
|
W
|
W
|
|
|
|
W
|
W
|
|
|
|
|
|
|
|
|
10.2. FEATURE SELECTION FOR CLASSIFICATION
|
291
|
(a) Discriminating direction is aligned with high-variance direction
(b) Discriminating direction is aligned with low-variance direction
Figure 10.2: Impact of class distribution on Fisher’s discriminating direction
Note that the quantity W ΣiW T in one of the aforementioned expressions represents the variance of the projection of a data set along W , whose covariance matrix is Σi. This result is derived in Sect. 2.4.3.1 of Chap. 2. The rank-1 matrix Sb = [(μ 1 − μ0)T (μ1 − μ0)] is also referred1 to as the (scaled) between-class scatter-matrix and the matrix Sw = (p0Σ0 + p1Σ1) is the (scaled) within-class scatter matrix. The quantification F S(W ) is a direct generalization of the axis-parallel score in Eq. 10.5 to an arbitrary direction W . The goal is to determine a direction W that maximizes the Fisher score. It can be shown2 that the optimal direction W ∗, expressed as a row vector, is given by the following:
|
∝ (
|
|
−
|
|
)(p0Σ0 + p1Σ1)−1.
|
(10.6)
|
|
W ∗
|
|
μ1
|
μ0
|
|
If desired, successive orthogonal directions may be determined by iteratively projecting the data into the orthogonal subspace to the optimal directions found so far, and determining the Fisher’s discriminant in this reduced subspace. The final result is a new representation of lower dimensionality that is more discriminative than the original feature space. Inter-estingly, the matrix Sw + p0p1Sb can be shown to be invariant to the values of the class labels of the data points (see Exercise 21), and it is equal to the covariance matrix of the data. Therefore, the top-k eigenvectors of Sw + p0p1Sb yield the basis vectors of PCA.
This approach is often used as a stand-alone classifier, which is referred to as linear discriminant analysis . A perpendicular hyperplane W ∗ ·X +b = 0 to the most discriminating direction is used as a binary class separator. The optimal value of b is selected based on the accuracy with respect to the training data. This approach can also be viewed as projecting the training points along the most discriminating vector W ∗, and then selecting the value of b to decide the point on the line that best separates the two classes. The Fisher’s discriminant for binary classes can be shown to be a special case of least-squares regression for numeric classes, in which the response variables are set to −1/p0 and +1/p1, respectively, for the two classes (cf. Sect. 11.5.1.1 of Chap. 11).
The unscaled versions of the two scatter matrices are np0p1Sb and nSw , respectively. The sum of these two matrices is the total scatter matrix, which is n times the covariance matrix (see Exercise 21).
|
|
|
|
|
|
|
|
Sb
|
|
|
T
|
|
|
|
|
T
|
|
|
|
|
T
|
|
|
2
|
|
|
|
|
|
W
|
W
|
|
|
|
|
|
|
|
|
|
|
Maximizing F S(W ) =
|
is the same as maximizing W SbW
|
subject to W Sw W
|
= 1. Setting
|
|
|
|
|
|
|
|
|
|
|
T
|
|
|
|
|
W Sw W
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
the gradient of the Lagrangian relaxation W SbW T − λ( W Sw W T − 1) to 0 yields the generalized eigenvector
condition Sb
|
|
|
T = λSw
|
|
T . Because Sb
|
|
T = (
|
|
|
T −
|
|
T ) (
|
|
−
|
|
)
|
|
|
T
|
always points in the direction
|
|
W
|
W
|
W
|
W
|
|
μ1
|
μ0
|
μ1
|
μ0
|
|
1
|
|
−
|
0
|
|
|
|
w
|
|
|
∝
|
1
|
|
−
|
0
|
|
|
|
|
|
|
|
|
|
|
|
∝
|
|
|
1 −
|
0
|
w
|
|
of (
|
μ
|
T
|
|
|
μ
|
T ), it follows that S W T
|
|
|
μ
|
T
|
|
|
μ
|
|
T . Therefore, we have W
|
|
(
|
μ
|
|
|
μ
|
)S−1.
|
|
292 CHAPTER 10. DATA CLASSIFICATION
10.2.2 Wrapper Models
Different classification models are more accurate with different sets of features. Filter models are agnostic to the particular classification algorithm being used. In some cases, it may be useful to leverage the characteristics of the specific classification algorithm to select features. As you will learn later in this chapter, a linear classifier may work more effectively with a set of features where the classes are best modeled with linear separators, whereas a distance-based classifier works well with features in which distances reflect class distributions.
Therefore, one of the inputs to wrapper-based feature selection is a specific classifica-tion induction algorithm, denoted by A. Wrapper models can optimize the feature selection process to the classification algorithm at hand. The basic strategy in wrapper models is to iteratively refine a current set of features F by successively adding features to it. The algo-rithm starts by initializing the current feature set F to {}. The strategy may be summarized by the following two steps that are executed iteratively:
Create an augmented set of features F by adding one or more features to the current feature set.
Use a classification algorithm A to evaluate the accuracy of the set of features F . Use the accuracy to either accept or reject the augmentation of F .
The augmentation of F can be performed in many different ways. For example, a greedy strategy may be used where the set of features in the previous iteration is augmented with an additional feature with the greatest discriminative power with respect to a filter criterion. Alternatively, features may be selected for addition via random sampling. The accuracy of the classification algorithm A in the second step may be used to determine whether the newly augmented set of features should be accepted, or one should revert to the set of features in the previous iteration. This approach is continued until there is no improvement in the current feature set for a minimum number of iterations. Because the classification algorithm A is used in the second step for evaluation, the final set of identified features will be sensitive to the choice of the algorithm A.
10.2.3 Embedded Models
The core idea in embedded models is that the solutions to many classification formulations provide important hints about the most relevant features to be used. In other words, knowl-edge about the features is embedded within the solution to the classification problem. For example, consider a linear classifier that maps a training instance X to a class label yi in {−1, 1} using the following linear relationship:
yi = sign{
|
|
·
|
|
+ b}.
|
(10.7)
|
|
W
|
X
|
|
Here, W = (w1, . . . wd) is a d-dimensional vector of coefficients, and b is a scalar that is learned from the training data. The function “sign” maps to either −1 or +1, depending on the sign of its argument. As we will see later, many linear models such as Fisher’s discriminant, support vector machine (SVM) classifiers, logistic regression methods, and neural networks use this model.
Assume that all features have been normalized to unit variance. If the value of |wi | is relatively3 small, the ith feature is used very weakly by the model and is more likely to be noninformative. Therefore, such dimensions may be removed. It is then possible to train the
Certain variations of linear models, such as L1-regularized SVMs or Lasso (cf. Sect. 11.5.1 of Chap. 11), are particularly effective in this context. Such methods are also referred to as sparse learning methods.
Table 10.1: Training data snapshot relating the salary and age features to charitable dona-tion propensity
|
Name
|
Age
|
Salary
|
Donor?
|
|
|
|
|
|
|
|
|
|
|
|
Nancy
|
21
|
37,000
|
N
|
|
Jim
|
27
|
41,000
|
N
|
|
Allen
|
43
|
61,000
|
Y
|
|
Jane
|
38
|
55,000
|
N
|
|
Steve
|
44
|
30,000
|
N
|
|
Peter
|
51
|
56,000
|
Y
|
|
Sayani
|
53
|
70,000
|
Y
|
|
Lata
|
56
|
74,000
|
Y
|
|
Mary
|
59
|
25,000
|
N
|
|
Victor
|
61
|
68,000
|
Y
|
|
Dale
|
63
|
51,000
|
Y
|
|
|
|
|
|
same (or a different) classifier on the data with the pruned feature set. If desired, statistical tests may be used to decide when the value of |wi| should be considered sufficiently small. Many decision tree classifiers, such as ID3, also have feature selection methods embedded in them.
In recursive feature elimination, an iterative approach is used. A small number of features are removed in each iteration. Then, the classifier is retrained on the pruned set of features to re-estimate the weights. The re-estimated weights are used to again prune the features with the least absolute weight. This procedure is repeated until all remaining features are deemed to be sufficiently relevant. Embedded models are generally designed in an ad hoc way, depending on the classifier at hand.
10.3 Decision Trees
Decision trees are a classification methodology, wherein the classification process is modeled with the use of a set of hierarchical decisions on the feature variables, arranged in a tree-like structure. The decision at a particular node of the tree, which is referred to as the split criterion, is typically a condition on one or more feature variables in the training data. The split criterion divides the training data into two or more parts. For example, consider the case where Age is an attribute, and the split criterion is Age ≤ 30. In this case, the left branch of the decision tree contains all training examples with age at most 30, whereas the right branch contains all examples with age greater than 30. The goal is to identify a split criterion so that the level of “mixing” of the class variables in each branch of the tree is reduced as much as possible. Each node in the decision tree logically represents a subset of the data space defined by the combination of split criteria in the nodes above it. The decision tree is typically constructed as a hierarchical partitioning of the training examples, just as a top-down clustering algorithm partitions the data hierarchically. The main difference from clustering is that the partitioning criterion in the decision tree is supervised with the class label in the training instances. Some classical decision tree algorithms include C4.5, ID3, and CART. To illustrate the basic idea of decision tree construction, an illustrative example will be used.
In Table 10.1, a snapshot of a hypothetical charitable donation data set has been illus-trated. The two feature variables represent the age and salary attributes. Both attributes
294 CHAPTER 10. DATA CLASSIFICATION
are related to the donation propensity, which is also the class label. Specifically, the like-lihood of an individual to donate is positively correlated with his or her age and salary. However, the best separation of the classes may be achieved only by combining the two attributes. The goal in the decision tree construction process is to perform a sequence of splits in top-down fashion to create nodes at the leaf level in which the donors and non-donors are separated well. One way of achieving this goal is depicted in Fig. 10.3a. The figure illustrates a hierarchical arrangement of the training examples in a treelike structure. The first -level split uses the age attribute, whereas the second-level split for both branches uses the salary attribute. Note that different splits at the same decision tree level need not be on the same attribute. Furthermore, the decision tree of Fig. 10.3a has two branches at each node, but this need not always be the case. In this case, the training examples in all leaf nodes belong to the same class, and, therefore, there is no point in growing the decision tree beyond the leaf nodes. The splits shown in Fig. 10.3a are referred to as univariate splits because they use a single attribute. To classify a test instance, a single relevant path in the tree is traversed in top-down fashion by using the split criteria to decide which branch to follow at each node of the tree. The dominant class label in the leaf node is reported as the relevant class. For example, a test instance with age less than 50 and salary less than 60,000 will traverse the leftmost path of the tree in Fig. 10.3a. Because the leaf node of this path contains only nondonor training examples, the test instance will also be classified as a nondonor.
Multivariate splits use more than one attribute in the split criteria. An example is illustrated in Fig. 10.3b. In this particular case, a single split leads to full separation of the classes. This suggests that multivariate criteria are more powerful because they lead to shallower trees. For the same level of class separation in the training data, shallower trees are generally more desirable because the leaf nodes contain more examples and, therefore, are statistically less likely to overfit the noise in the training data.
A decision tree induction algorithm has two types of nodes, referred to as the internal nodes and leaf nodes. Each leaf node is labeled with the dominant class at that node. A special internal node is the root node that corresponds to the entire feature space. The generic decision tree induction algorithm starts with the full training data set at the root node and recursively partitions the data into lower level nodes based on the split criterion. Only nodes that contain a mixture of different classes need to be split further. Eventually, the decision tree algorithm stops the growth of the tree based on a stopping criterion . The simplest stopping criterion is one where all training examples in the leaf belong to the same class. One problem is that the construction of the decision tree to this level may lead to overfitting, in which the model fits the noisy nuances of the training data. Such a tree will not generalize to unseen test instances very well. To avoid the degradation in accuracy associated with overfitting, the classifier uses a postpruning mechanism for removing overfitting nodes. The generic decision tree training algorithm is illustrated in Fig. 10.4.
After a decision tree has been constructed, it is used for classification of unseen test instances with the use of top- down traversal from the root to a unique leaf. The split condition at each internal node is used to select the correct branch of the decision tree for further traversal. The label of the leaf node that is reached is reported for the test instance.
10.3.1 Split Criteria
The goal of the split criterion is to maximize the separation of the different classes among the children nodes. In the following, only univariate criteria will be discussed. Assume that
Dostları ilə paylaş: |