a systematic way to search for the appropriate combination of features, in addition to quantifying the distance-based entropy. So how can the distance-based entropy be quantified on a particular subset of attributes?
A natural way of quantifying the entropy is to directly use the probability distribution on the data points and quantify the entropy using these values. Consider a k-dimensional subset of features. The first step is to discretize the data into a set of multidimensional grid regions using φ grid regions for each dimension. This results in m = φk grid ranges that are indexed from 1 through m. The value of m is approximately the same across all the evaluated feature subsets by selecting φ = m 1/k. If p i is the fraction of data points in grid region i, then the probability-based entropy E is defined as follows:
m
|
|
E = − [pilog(pi) + (1 − pi)log(1 − pi)].
|
(6.2)
|
i=1
A uniform distribution with poor clustering behavior has high entropy, whereas clustered data has lower entropy. Therefore, the entropy measure provides feedback about the clus-tering quality of a subset of features.
Although the aforementioned quantification can be used directly, the probability density pi of grid region i is sometimes hard to accurately estimate from high-dimensional data. This is because the grid regions are multidimensional, and they become increasingly sparse in high dimensionality. It is also hard to fix the number of grid regions m over feature subsets of varying dimensionality k because the value of φ = m 1/k is rounded up to an integer value. Therefore, an alternative is to compute the entropy on the 1-dimensional point-to-point distance distribution on a sample of the data. This is the same as the distributions shown in Fig. 6.1. The value of pi then represents the fraction of distances in the ith 1-dimensional discretized range. Although this approach does not fully address the challenges of high dimensionality, it is usually a better option for data of modest dimensionality. For example, if the entropy is computed on the histograms in Figs. 6.1c and d, then this will distinguish between the two distributions well. A heuristic approximation on the basis of the raw distances is also often used. Refer to the bibliographic notes.
To determine the subset of features, for which the entropy E is minimized, a variety of search strategies are used. For example, starting from the full set of features, a simple greedy approach may be used to drop the feature that leads to the greatest reduction in the entropy. Features are repeatedly dropped greedily until the incremental reduction is not significant, or the entropy increases. Some enhancements of this basic approach, both in terms of the quantification measure and the search strategy, are discussed in the bibliographic section.
6.2.1.4 Hopkins Statistic
The Hopkins statistic is often used to measure the clustering tendency of a data set, although it can also be applied to a particular subset of attributes. The resulting measure can then be used in conjunction with a feature search algorithm, such as the greedy method discussed in the previous subsection.
Let D be the data set whose clustering tendency needs to be evaluated. A sample S of r synthetic data points is randomly generated in the domain of the data space. At the same time, a sample R of r data points is selected from D. Let α1 . . . αr be the distances of the data points in the sample R ⊆ D to their nearest neighbors within the original database D. Similarly, let β1 . . . βr be the distances of the data points in the synthetic sample S to their
158 CHAPTER 6. CLUSTER ANALYSIS
nearest neighbors within D. Then, the Hopkins statistic H is defined as follows:
Dostları ilə paylaş: |