Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	148/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 144 145 146 147 148 149 150 151 ... 423

1-Data Mining tarjima

f_X (z_i) =			−z_i		(8.3)

	σ · ^√		· e	2		.
	σ · ^√	2 · π	· e			.

This implies that the cumulative normal distribution may be used to determine the area of the tail that is larger than z_i. As a rule of thumb, if the absolute values of the Z- number are greater than 3, the corresponding data points are considered extreme values. At this threshold, the cumulative area inside the tail can be shown to be less than 0.01 % for the normal distribution.

When a smaller number n of data samples is available for estimating the mean μ and standard deviations σ, the aforementioned methodology can be used with a minor modifi-cation. The value of z_i is computed as before, and the student t-distribution with n degrees

242 CHAPTER 8. OUTLIER ANALYSIS

Figure 8.3: Multivariate extreme values

of freedom is used to quantify the cumulative distribution in the tail instead of the nor-mal distribution. Note that, when n is large, the t-distribution converges to the normal distribution.

8.2.2 Multivariate Extreme Values

Strictly speaking, tails are defined for univariate distributions. However, just as the uni-variate tails are defined as extreme regions with probability density less than a particular threshold, an analogous concept can also be defined for multivariate distributions. The concept is more complex than the univariate case and is defined for unimodal probability distributions with a single peak. As in the previous case, a multivariate Gaussian model is used, and the corresponding parameters are estimated in a data-driven manner. The implicit modeling assumption of multivariate extreme value analysis is that all data points are located in a probability distribution with a single peak (i.e., single Gaussian cluster), and data points in all directions that are as far away as possible from the center of the cluster should be considered extreme values.

Let μ be the d-dimensional mean vector of a d-dimensional data set, and Σ be its d × d covariance matrix. Thus, the (i, j)th entry of the covariance matrix is equal to the covariance between the dimensions i and j. These represent the estimated parameters of the multivariate Gaussian distribution. Then, the probability distribution f (X) for a d-dimensional data point X can be defined as follows:

1	1	−1	T

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 144 145 146 147 148 149 150 151 ... 423