Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	150/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 146 147 148 149 150 151 152 153 ... 423

1-Data Mining tarjima

k = 1;

repeat
Find set S of corners of convex hull of D;

Assign depth k to points in S;

=D−S; k = k + 1;

until(D is empty);

Report points with depth at most r as outliers;

end

Figure 8.4: Depth-based methods
The value of r may itself need to be determined by univariate extreme value analysis. The steps of the depth-based approach are illustrated in Fig. 8.4.
A pictorial illustration of the depth-based method is illustrated in Fig. 8.5. The process can be viewed as analogous to peeling the diﬀerent layers of an onion (as shown in Fig. 8.5b) where the outermost layers define the outliers. Depth-based methods try to achieve the same goal as the multivariate method of the previous section, but it generally tends to be less eﬀective both in terms of quality and computational eﬃciency. From a qualitative perspec-tive, depth-based methods do not normalize for the characteristics of the statistical data distribution, as is the case for multivariate methods based on the Mahalanobis distance. All data points at the corners of a convex hull are treated equally. This is clearly not desirable, and the scores of many data points are indistinguishable because of ties. Furthermore, the fraction of data points at the corners of the convex hull generally increases with dimen-sionality. For very high dimensionality, it may not be uncommon for the majority of the data points to be located at the corners of the outermost convex hull. As a result, it is no longer possible to distinguish the outlier scores of diﬀerent data points. The computational complexity of convex-hull methods increases significantly with dimensionality. The combi-nation of qualitative and computational issues associated with this method make it a poor alternative to the multivariate methods based on the Mahalanobis distance.
8.3 Probabilistic Models

Probabilistic models are based on a generalization of the multivariate extreme values analy-sis methods discussed in Sect. 8.2.2. The Mahalanobis distance-based multivariate extreme value analysis method can be viewed as a Gaussian mixture model with a single component in the mixture. By generalizing this model to multiple mixture components, it is possible to determine general outliers, rather than multivariate extreme values. This idea is intimately related to the EM -clustering algorithm discussed in Sect. 6.5 of Chap. 6. At an intuitive level, data points that do not naturally fit any cluster in the probabilistic sense may be reported as outliers. The reader is referred to Sect. 6.5 of Chap. 6 for a more detailed discussion of the EM algorithm, though a brief outline is provided here for convenience.

The broad principle of a mixture-based generative model is to assume that the data were generated from a mixture of k distributions with the probability distributions G₁ . . . G_k based on the following process:

8.3. PROBABILISTIC MODELS

245

Figure 8.5: Depth-based outlier detection

Select a mixture component with prior probability α_i, where i ∈ {1 . . . k}. Assume that the rth one is selected.

Generate a data point from G_r.

This generative model will be denoted by M, and it generates each point in the data set D. The data set D is used to estimate the parameters of the model. Although it is natural to use Gaussians to represent each component of the mixture, other models may be used if desired. This flexibility is very useful to apply the approach to diﬀerent data types. For example, in a categorical data set, a categorical probability distribution may be used for each mixture component instead of the Gaussian distribution. After the parameters of the model have been estimated, outliers are defined as those data points in D that are highly unlikely to be generated by this model. Note that such an assumption exactly reflects Hawkins’s definition of outliers, as stated at the beginning of this chapter.

Next, we discuss the estimation of the various parameters of the model such as the estimation of diﬀerent values of α_i and the parameters of the diﬀerent distributions G_r. The objective function of this estimation process is to ensure that the full data D has the maximum likelihood fit to the generative model. Assume that the density function of G_i is given by f i(·). The probability (density function) of the data point X_j being generated by the model is given by the following:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 146 147 148 149 150 151 152 153 ... 423