fX (zi) =
|
|
|
−zi
|
|
(8.3)
|
|
|
σ · √
|
|
· e
|
2
|
.
|
|
|
2 · π
|
|
|
|
This implies that the cumulative normal distribution may be used to determine the area of the tail that is larger than zi. As a rule of thumb, if the absolute values of the Z- number are greater than 3, the corresponding data points are considered extreme values. At this threshold, the cumulative area inside the tail can be shown to be less than 0.01 % for the normal distribution.
When a smaller number n of data samples is available for estimating the mean μ and standard deviations σ, the aforementioned methodology can be used with a minor modifi-cation. The value of zi is computed as before, and the student t-distribution with n degrees
242 CHAPTER 8. OUTLIER ANALYSIS
Figure 8.3: Multivariate extreme values
of freedom is used to quantify the cumulative distribution in the tail instead of the nor-mal distribution. Note that, when n is large, the t-distribution converges to the normal distribution.
8.2.2 Multivariate Extreme Values
Strictly speaking, tails are defined for univariate distributions. However, just as the uni-variate tails are defined as extreme regions with probability density less than a particular threshold, an analogous concept can also be defined for multivariate distributions. The concept is more complex than the univariate case and is defined for unimodal probability distributions with a single peak. As in the previous case, a multivariate Gaussian model is used, and the corresponding parameters are estimated in a data-driven manner. The implicit modeling assumption of multivariate extreme value analysis is that all data points are located in a probability distribution with a single peak (i.e., single Gaussian cluster), and data points in all directions that are as far away as possible from the center of the cluster should be considered extreme values.
Let μ be the d-dimensional mean vector of a d-dimensional data set, and Σ be its d × d covariance matrix. Thus, the (i, j)th entry of the covariance matrix is equal to the covariance between the dimensions i and j. These represent the estimated parameters of the multivariate Gaussian distribution. Then, the probability distribution f (X) for a d-dimensional data point X can be defined as follows:
Dostları ilə paylaş: |