Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə148/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   144   145   146   147   148   149   150   151   ...   423
1-Data Mining tarjima

fX (zi) =







zi




(8.3)







σ ·




· e

2

.







2 · π










This implies that the cumulative normal distribution may be used to determine the area of the tail that is larger than zi. As a rule of thumb, if the absolute values of the Z- number are greater than 3, the corresponding data points are considered extreme values. At this threshold, the cumulative area inside the tail can be shown to be less than 0.01 % for the normal distribution.


When a smaller number n of data samples is available for estimating the mean μ and standard deviations σ, the aforementioned methodology can be used with a minor modifi-cation. The value of zi is computed as before, and the student t-distribution with n degrees


242 CHAPTER 8. OUTLIER ANALYSIS


Figure 8.3: Multivariate extreme values


of freedom is used to quantify the cumulative distribution in the tail instead of the nor-mal distribution. Note that, when n is large, the t-distribution converges to the normal distribution.


8.2.2 Multivariate Extreme Values

Strictly speaking, tails are defined for univariate distributions. However, just as the uni-variate tails are defined as extreme regions with probability density less than a particular threshold, an analogous concept can also be defined for multivariate distributions. The concept is more complex than the univariate case and is defined for unimodal probability distributions with a single peak. As in the previous case, a multivariate Gaussian model is used, and the corresponding parameters are estimated in a data-driven manner. The implicit modeling assumption of multivariate extreme value analysis is that all data points are located in a probability distribution with a single peak (i.e., single Gaussian cluster), and data points in all directions that are as far away as possible from the center of the cluster should be considered extreme values.


Let μ be the d-dimensional mean vector of a d-dimensional data set, and Σ be its d × d covariance matrix. Thus, the (i, j)th entry of the covariance matrix is equal to the covariance between the dimensions i and j. These represent the estimated parameters of the multivariate Gaussian distribution. Then, the probability distribution f (X) for a d-dimensional data point X can be defined as follows:














1




1
















−1
















T
























































Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   144   145   146   147   148   149   150   151   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin