k = 1;
repeat
Find set S of corners of convex hull of D;
Assign depth k to points in S;
until(D is empty);
Report points with depth at most r as outliers;
end
Figure 8.4: Depth-based methods
The value of r may itself need to be determined by univariate extreme value analysis. The steps of the depth-based approach are illustrated in Fig. 8.4.
A pictorial illustration of the depth-based method is illustrated in Fig. 8.5. The process can be viewed as analogous to peeling the different layers of an onion (as shown in Fig. 8.5b) where the outermost layers define the outliers. Depth-based methods try to achieve the same goal as the multivariate method of the previous section, but it generally tends to be less effective both in terms of quality and computational efficiency. From a qualitative perspec-tive, depth-based methods do not normalize for the characteristics of the statistical data distribution, as is the case for multivariate methods based on the Mahalanobis distance. All data points at the corners of a convex hull are treated equally. This is clearly not desirable, and the scores of many data points are indistinguishable because of ties. Furthermore, the fraction of data points at the corners of the convex hull generally increases with dimen-sionality. For very high dimensionality, it may not be uncommon for the majority of the data points to be located at the corners of the outermost convex hull. As a result, it is no longer possible to distinguish the outlier scores of different data points. The computational complexity of convex-hull methods increases significantly with dimensionality. The combi-nation of qualitative and computational issues associated with this method make it a poor alternative to the multivariate methods based on the Mahalanobis distance.
8.3 Probabilistic Models
Probabilistic models are based on a generalization of the multivariate extreme values analy-sis methods discussed in Sect. 8.2.2. The Mahalanobis distance-based multivariate extreme value analysis method can be viewed as a Gaussian mixture model with a single component in the mixture. By generalizing this model to multiple mixture components, it is possible to determine general outliers, rather than multivariate extreme values. This idea is intimately related to the EM -clustering algorithm discussed in Sect. 6.5 of Chap. 6. At an intuitive level, data points that do not naturally fit any cluster in the probabilistic sense may be reported as outliers. The reader is referred to Sect. 6.5 of Chap. 6 for a more detailed discussion of the EM algorithm, though a brief outline is provided here for convenience.
The broad principle of a mixture-based generative model is to assume that the data were generated from a mixture of k distributions with the probability distributions G1 . . . Gk based on the following process:
Figure 8.5: Depth-based outlier detection
Select a mixture component with prior probability αi, where i ∈ {1 . . . k}. Assume that the rth one is selected.
Generate a data point from Gr.
This generative model will be denoted by M, and it generates each point in the data set D. The data set D is used to estimate the parameters of the model. Although it is natural to use Gaussians to represent each component of the mixture, other models may be used if desired. This flexibility is very useful to apply the approach to different data types. For example, in a categorical data set, a categorical probability distribution may be used for each mixture component instead of the Gaussian distribution. After the parameters of the model have been estimated, outliers are defined as those data points in D that are highly unlikely to be generated by this model. Note that such an assumption exactly reflects Hawkins’s definition of outliers, as stated at the beginning of this chapter.
Next, we discuss the estimation of the various parameters of the model such as the estimation of different values of αi and the parameters of the different distributions Gr. The objective function of this estimation process is to ensure that the full data D has the maximum likelihood fit to the generative model. Assume that the density function of Gi is given by f i(·). The probability (density function) of the data point Xj being generated by the model is given by the following:
Dostları ilə paylaş: |