f (X) =
|
· i=1 Kh(X − Xi).
|
(8.14)
|
|
n
|
|
Thus, each discrete point Xi in the data set is replaced by a continuous function Kh(·) that peaks at Xi and has a variance determined by the smoothing parameter h. An example of such a distribution is the Gaussian kernel with width h.
|
|
|
|
|
1
|
|
d
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2
|
2
|
|
|
|
Kh(X − Xi) =
|
|
(8.15)
|
|
√
|
|
· h
|
· e−||X−Xi||
|
/(2h
|
)
|
|
2π
|
|
|
|
The estimation error is defined by the kernel width h, which is chosen in a data-driven man-ner. It has been shown that for most smooth functions K h(·), when the number of data points goes to infinity, the estimate asymptotically converges to the true density value, provided that the width h is chosen appropriately. The density at each data point is computed with-out including the point itself in the density computation. The value of the density is reported as the outlier score. Low values of the density indicate greater tendency to be an outlier.
Density-based methods have similar challenges as histogram- and grid-based techniques. In particular, the use of a global kernel width h to estimate density may not work very well in cases where there are wide variations in local density, such as those in Figs. 8.7 and 8.8. This is because of the myopic nature of density-based methods, in which the variations in the density distribution are not well accounted for. Nevertheless, kernel-density-based methods can be better generalized to data with local variations, especially if the bandwidth is chosen locally. As in the case of grid-based methods, these techniques are not very effective for higher dimensionality. The reason is that the accuracy of the density estimation approach degrades with increasing dimensionality.
8.7 Information-Theoretic Models
Outliers are data points that do not naturally fit the remaining data distribution. Therefore, if a data set were to be somehow compressed with the use of the “normal” patterns in the
8.7. INFORMATION-THEORETIC MODELS
|
257
|
data distribution, the outliers would increase the minimum code length required to describe it. For example, consider the following two strings:
ABABABABABABABABABABABABABABABABAB
ABABACABABABABABABABABABABABABABAB
The second string is of the same length as the first and is different at only a single position containing the unique symbol C. The first string can be described concisely as “AB 17 times.” However, the second string has a single position corresponding to the symbol C. Therefore, the second string can no longer be described as concisely. In other words, the presence of the symbol C somewhere in the string increases its minimum description length. It is also easy to see that this symbol corresponds to an outlier. Information-theoretic models are based on this general principle because they measure the increase in model size required to describe the data as concisely as possible.
Information-theoretic models can be viewed as almost equivalent to conventional deviation-based models, except that the outlier score is defined by the model size for a fixed deviation, rather than the deviation for a fixed model. In conventional models, out-liers are always defined on the basis of a “summary” model of normal patterns. When a data point deviates significantly from the estimations of the summary model, then this deviation value is reported as the outlier score. Clearly, a trade- off exists between the size of the summary model and the level of deviation. For example, if a clustering model is used, then a larger number of cluster centroids (model size) will result in lowering the maximum deviation of any data point (including the outlier) from its nearest centroid. Therefore, in conventional models, the same clustering is used to compute deviation values (scores) for the different data points. A slightly different way of computing the outlier score is to fix the maximum allowed deviation (instead of the number of cluster centroids) and compute the number of cluster centroids required to achieve the same level of deviation, with and without a particular data point. It is this increase that is reported as the outlier score in the information-theoretic version of the same model. The idea here is that each point can be estimated by its closest cluster centroid, and the cluster centroids serve as a “code-book” in terms of which the data is compressed in a lossy way.
Information-theoretic models can be viewed as a complementary version of conventional models where a different aspect of the space-deviation trade-off curve is examined. Virtu-ally every conventional model can be converted into an information-theoretic version by examining the bi-criteria space-deviation trade-off in terms of space rather than deviation. The bibliographic notes will also provide specific examples of each of the cases below:
The probabilistic model of Sect. 8.3 models the normal patterns in terms of genera-tive model parameters such as the mixture means and covariance matrices. The space required by the model is defined by its complexity (e.g., number of mixture com-ponents), and the deviation corresponds to the probabilistic fit. In an information-theoretic version of the model, the complementary approach is to examine the size of the model required to achieve a fixed level of fit.
A clustering or density-based summarization model describes a data set in terms of cluster descriptions, histograms or other summarized representations. The granularity of these representations (number of cluster centroids, or histogram bins) controls the space, whereas the error in approximating the data point with a central element of the cluster (bin) defines the deviation. In conventional models, the size of the model (number of bins or clusters) is fixed, whereas in the information-theoretic version, the
258 CHAPTER 8. OUTLIER ANALYSIS
maximum allowed deviation is fixed, and the required model size is reported as the outlier score.
A frequent pattern mining model describes the data in terms of an underlying code-book of frequent patterns. The larger the size of the code-book (by using frequent pat-terns of lower support), the more accurately the data can be described. These models are particularly popular, and some pointers are provided in the bibliographic notes.
All these models represent the data approximately in terms of individual condensed compo-nents representing aggregate trends. In general, outliers increase the length of the descrip-tion in terms of these condensed components to achieve the same level of approximation. For example, a data set with outliers will require a larger number of mixture parame-ters, clusters, or frequent patterns to achieve the same level of approximation. Therefore, in information-theoretic methods, the components of these summary models are loosely referred to as “code books.” Outliers are defined as data points whose removal results in the largest decrease in description length for the same error. The actual construction of the coding is often heuristic, and is not very different from the summary models used in conven-tional outlier analysis. In some cases, the description length for a data set can be estimated without explicitly constructing a code book, or building a summary model. An example is that of the entropy of a data set, or the Kolmogorov complexity of a string. Readers are referred to the bibliographic notes for examples of such methods.
While information-theoretic models are approximately equivalent to conventional models in that they explore the same trade-off in a slightly different way, they do have an advantage in some cases. These are cases where an accurate summary model of the data is hard to explicitly construct, and measures such as the entropy or Kolmogorov complexity can be used to estimate the compressed space requirements of the data set indirectly. In such cases, information-theoretic methods can be useful. In cases where the summary models can be explicitly constructed, it is better to use conventional models because the outlier scores are directly optimized to point-specific deviations rather than the more blunt measure of differential space impact. The bibliographic notes provide specific examples of some of the aforementioned methods.
8.8 Outlier Validity
As in the case of clustering models, it is desirable to determine the validity of outliers determined by a particular algorithm. Although the relationship between clustering and outlier analysis is complementary, the measures for outlier validity cannot easily be designed in a similar complementary way. In fact, validity analysis is much harder in outlier detection than data clustering. The reasons for this are discussed in the next section.
8.8.1 Methodological Challenges
As in the case of data clustering, outlier analysis is an unsupervised problem. Unsupervised problems are hard to validate because of the lack of external criteria, unless such criteria are synthetically generated, or some rare aspects of real data sets are used as proxies. Therefore, a natural question arises, as to whether internal criteria can be defined for outlier validation, as is the case for data clustering.
However, internal criteria are rarely used in outlier analysis. While such criteria are well-known to be flawed even in the context of data clustering, these flaws become significant
Dostları ilə paylaş: |