Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	177/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 173 174 175 176 177 178 179 180 ... 423

1-Data Mining tarjima

Wrapper models: It is assumed that a classification algorithm is available to evaluate how well the algorithm performs with a particular subset of features. A feature search algorithm is then wrapped around this algorithm to determine the relevant set of features.

Embedded models: The solution to a classification model often contains useful hints about the most relevant features. Such features are isolated, and the classifier is retrained on the pruned features.

In the following discussion, each of these models will be explained in detail.

10.2.1 Filter Models

In filter models, a feature or a subset of features is evaluated with the use of a class-sensitive discriminative criterion. The advantage of evaluating a group of features at one time is that redundancies are well accounted for. Consider the case where two feature variables are perfectly correlated with one another, and therefore each can be predicted using the other. In such a case, it makes sense to use only one of these features because the other adds no incremental knowledge with respect to the first. However, such methods are often expensive because there are 2d possible subsets of features on which a search may need to be performed. Therefore, in practice, most feature selection methods evaluate the features independently of one another and select the most discriminative ones.

Some feature selection methods, such as linear discriminant analysis, create a linear combination of the original features as a new set of features. Such analytical methods can be viewed either as stand- alone classifiers or as dimensionality reduction methods that are used before classification, depending on how they are used. These methods will also be discussed in this section.

10.2.1.1 Gini Index

The Gini index is commonly used to measure the discriminative power of a particular feature. Typically, it is used for categorical variables, but it can be generalized to numeric attributes by the process of discretization. Let v₁ . . . v_r be the r possible values of a particular categorical attribute, and let p_j be the fraction of data points containing attribute value v_i that belong to the class j ∈ {1 . . . k} for the attribute value v_i. Then, the Gini index G(v_i) for the value v_i of a categorical attribute is defined as follows:

k
G(v_i) = 1 − p_j².	(10.1)
j=1

When the diﬀerent classes are distributed evenly for a particular attribute value, the value of the Gini index is 1 − 1/k. On the other hand, if all data points for an attribute value

10.2. FEATURE SELECTION FOR CLASSIFICATION

289

	1
	0.9
	0.8
	0.7
VALUE	0.6
VALUE
CRITERION	0.5
	0.4

	0.3
	0.2
	0.1					GINI INDEX
	0.1					ENTROPY
						ENTROPY
	0₀	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
					FRACTION OF FIRST CLASS

Figure 10.1: Variation of two feature selection criteria with class distribution skew

v_i belong to the same class, then the Gini index is 0. Therefore, lower values of the Gini index imply greater discrimination. An example of the Gini index for a two- class problem for varying values of p₁ is illustrated in Fig. 10.1. Note that the index takes on its maximum value at p₁ = 0.5.

The value-specific Gini index is converted into an attributewise Gini index. Let n_i be the number of data points that take on the value v_i for the attribute. Then, for a data set

containing	r
	_i₌₁n_i = n data points, the overall Gini index G for the attribute is defined as
the weighted average over the diﬀerent attribute values as follows:
		r
	G =	n_iG(v_i)/n.	(10.2)

i=1

Lower values of the Gini index imply greater discriminative power. The Gini index is typi-cally defined for a particular feature rather than a subset of features.

10.2.1.2 Entropy

The class-based entropy measure is related to notions of information gain resulting from fixing a specific attribute value. The entropy measure achieves a similar goal as the Gini index at an intuitive level, but it is based on sound information-theoretic principles. As before, let p_j be the fraction of data points belonging to the class j for attribute value v_i. Then, the class-based entropy E(v_i) for the attribute value v_i is defined as follows:

k
E(v_i) = − p_j log₂(p_j ).	(10.3)
j=1

The class-based entropy value lies in the interval [0, log ₂(k)]. Higher values of the entropy imply greater “mixing” of diﬀerent classes. A value of 0 implies perfect separation, and, therefore, the largest possible discriminative power. An example of the entropy for a two-class problem with varying values of the probability p₁ is illustrated in Fig. 10.1. As in the case of the Gini index, the overall entropy E of an attribute is defined as the weighted

290 CHAPTER 10. DATA CLASSIFICATION

average over the r diﬀerent attribute values:

r
E =n_iE(v_i)/n.	(10.4)
i=1

Here, n_i is the frequency of attribute value v_i.

10.2.1.3 Fisher Score

The Fisher score is naturally designed for numeric attributes to measure the ratio of the average interclass separation to the average intraclass separation. The larger the Fisher score, the greater the discriminatory power of the attribute. Let μ_j and σ_j , respectively, be the mean and standard deviation of data points belonging to class j for a particular feature, and let p_j be the fraction of data points belonging to class j. Let μ be the global mean of the data on the feature being evaluated. Then, the Fisher score F for that feature may be defined as the ratio of the interclass separation to intraclass separation:

	k

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 173 174 175 176 177 178 179 180 ... 423