Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə149/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   145   146   147   148   149   150   151   152   ...   423
1-Data Mining tarjima

f (X) =

· e

·(X−μ

(X−μ)

(8.4)













2




.










· (2 · π)(d/2)







|Σ|










The value of |Σ| denotes the determinant of the covariance matrix. The term in the exponent is half the square of the Mahalanobis distance between data point X and the mean μ of the data. In other words, if M aha(X, μ, Σ) represents the Mahalanobis distance between



  • and μ, with respect to the covariance matrix Σ, then the probability density function of the normal distribution is as follows:













1




1













2




























f (X) =













(8.5)













· e

2

·M aha(X,μ,Σ) .










· (2 · π)(d/2)




|Σ|










8.2. EXTREME VALUE ANALYSIS

243

For the probability density to fall below a particular threshold, the Mahalanobis distance needs to be larger than a particular threshold. Thus, the Mahalanobis distance to the mean of the data can be used as an extreme-value score. The relevant extreme values are defined by the multidimensional region of the data for which the Mahalanobis distance to the mean is larger than a particular threshold. This region is illustrated in Fig. 8.3b. Therefore, the extreme value score for a data point can be reported as the Mahalanobis distance between that data point and the mean. Larger values imply more extreme behavior. In some cases, one might want a more intuitive probability measure. Correspondingly, the extreme value probability of a data point X is defined by the cumulative probability of the multidimensional region for which the Mahalanobis distance to the mean μ of the data is greater than that between X and μ. How can one estimate this cumulative probability?


As discussed in Chap. 3, the Mahalanobis distance is similar to the Euclidean distance except that it standardizes the data along uncorrelated directions. For example, if the axis system of the data were to be rotated to the principal directions (shown in Fig. 8.3), then the transformed coordinates in this new axis system would have no interattribute correlations (i.e., a diagonal covariance matrix). The Mahalanobis distance is simply equal to the Euclidean distance in such a transformed (axes-rotated) data set after dividing each of the transformed coordinate values by the standard deviation along its direction. This approach provides a neat way to model the probability distribution of the Mahalanobis distance, and it also provides a concrete estimate of the cumulative probability in the multivariate tail.


Because of the scaling by the standard deviation, each of the independent components of the Mahalanobis distances along the principal correlation directions can be modeled as a 1-dimensional standard normal distribution with mean 0 and variance 1. The sum of the squares of d variables, drawn independently from standard normal distributions, will result in a variable drawn from an χ2 distribution with d degrees of freedom. Therefore, the cumulative probability of the region of the χ2 distribution with d degrees of freedom, for which the value is greater than M aha(X, μ, Σ), can be reported as the extreme value probability of X. Smaller values of the probability imply greater likelihood of being an extreme value.


Intuitively, this approach models the data distribution along the various uncorrelated directions as statistically independent normal distributions and standardizes them so as to provide each such direction equal importance in the outlier score. In Fig. 8.3a, data point B can be more reasonably considered a multivariate extreme value than data point A, on the basis of the natural correlations in the data. On the other hand, the data point B is closer to the centroid of the data (than data point A) on the basis of Euclidean distance but not on the basis of the Mahalanobis distance. This shows the utility of the Mahalanobis distance in using the underlying statistical distribution of the data to infer the outlier behavior of the data points more effectively.


8.2.3 Depth-Based Methods

Depth-based methods are based on the general principle that the convex hull of a set of data points represents the pareto-optimal extremes of this set. A depth-based algorithm proceeds in an iterative fashion, where during the k -th iteration, all points at the corners of the convex hull of the data set are removed. The index of the iteration k also provides an outlier score where smaller values indicate a greater tendency for a data point to be an outlier. These steps are repeated until the data set is empty. The outlier score may be converted to a binary label by reporting all data points with depth at most r as outliers.


244 CHAPTER 8. OUTLIER ANALYSIS

Algorithm FindDepthOutliers(Data Set: D, Score Threshold: r)


begin


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   145   146   147   148   149   150   151   152   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin