8.4. CLUSTERING FOR OUTLIER DETECTION
|
247
|
Figure 8.7: Small isolated groups of anomalies
points outside the clusters. The detection of outliers as a side-product of clustering methods is, however, not an appropriate approach because clustering algorithms are not optimized for outlier detection. Data points on the boundary regions of a cluster may also be considered weak outliers but are rarely useful in most application-specific scenarios.
Clustering models do have some advantages as well. Outliers often tend to occur in small clusters of their own. This is because the anomaly in the generating process may be repeated a few times. As a result, a small group of related outliers may be created. An example of a small set of isolated outliers is illustrated in Fig. 8.7. As will be discussed later, clustering methods are generally robust to such scenarios because such groups often do not have the critical mass required to form clusters of their own.
A simple way of defining the outlier score of a data point is to first cluster the data set and then use the raw distance of the data point to its closest cluster centroid. One can, however, do better when the clusters are elongated or have varying density over the data set. As discussed in Chap. 3, the local data distribution often distorts the distances, and, therefore, it is not optimal to use the raw distance. This broader principle is used in multivariate extreme value analysis where the global Mahalanobis distance defines outlier scores. In this case, the local Mahalanobis distance can be used with respect to the centroid of the closest cluster.
Consider a data set in which k clusters have been discovered with the use of a clus-tering algorithm. Assume that the rth cluster in d-dimensional space has a corresponding d-dimensional mean vector μr, and a d × d covariance matrix Σr . The (i, j)th entry of this matrix is the covariance between the dimensions i and j in that cluster. Then, the Maha-lanobis distance M aha(X, μr, Σr) between a data point X and cluster centroid μr is defined as follows:
M aha(
|
|
|
|
, Σr) = (
|
|
−
|
|
)Σr−1(
|
|
−
|
|
)T .
|
(8.9)
|
|
X,
|
X
|
X
|
|
μr
|
μr
|
μr
|
|
This distance is reported as the outlier score. Larger values of the outlier score indicate a greater outlier tendency. After the outlier score has been determined, univariate extreme value analysis may be used to convert the scores to binary labels.
The justification for using the Mahalanobis distance is exactly analogous to the case of extreme value analysis of multivariate distances, as discussed in Sect. 8.2. The only difference is that the local cluster-specific Mahalanobis distances are more relevant to determination of
248 CHAPTER 8. OUTLIER ANALYSIS
general outliers, whereas global Mahalanobis distances are more relevant to determination of specific types of outliers, such as extreme values. The use of the local Mahalanobis distance also has an interesting connection to the likelihood fit criterion of EM algorithm where the (squared) Mahalanobis distance occurs in the exponent of each Gaussian mixture. Thus, the sum of the inverse exponentiated Mahalanobis distances of a data point to different mixture component means (cluster means) are used to determine the outlier score in the EM algorithm. Such a score can be viewed as a soft version of the score determined by hard clustering algorithms.
Clustering methods are based on global analysis. Therefore, small, closely related groups of data points will not form their own clusters in most cases. For example, the four isolated points in Fig. 8.7 will not typically be considered a cluster. Most clustering algorithms require a minimum critical mass for a set of data points to be considered a cluster. As a result, these points will have a high outlier score. This means that clustering methods are able to detect these small and closely related groups of data points meaningfully and report them as outliers. This is not the case for some of the density-based methods that are based purely on local analysis.
The major problem with clustering algorithms is that they are sometimes not able to properly distinguish between a data point that is ambient noise and a data point that is a truly isolated anomaly. Clearly, the latter is a much stronger anomaly than the former. Both these types of points will not reside in a cluster. Therefore, the distance to the closest cluster centroid will often not be very representative of their local distribution (or instance-specific distribution). In these cases, distance-based methods are more effective.
8.5 Distance-Based Outlier Detection
Because outliers are defined as data points that are far away from the “crowded regions” (or clusters) in the data, a natural and instance-specific way of defining an outlier is as follows:
Dostları ilə paylaş: |