Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə151/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   147   148   149   150   151   152   153   154   ...   423
1-Data Mining tarjima

k




f point(Xj |M) =αi · f i(Xj ).

(8.6)



i=1

Note that the density value f point(Xj |M) provides an estimate of the outlier score of the data point. Data points that are outliers will naturally have low fit values. Examples of the relationship of the fit values to the outlier scores are illustrated in Fig. 8.6. Data points A and B will typically have very low fit to the mixture model and will be considered outliers because the data points A and B do not naturally belong to any of the mixture components. Data point C will have high fit to the mixture model and will, therefore, not be considered an outlier. The parameters of the model M are estimated using a maximum likelihood criterion, which is discussed below.


For data set D containing n data points, denoted by X1 . . . Xn, the probability density of the data set being generated by model M is the product of the various point-specific


246 CHAPTER 8. OUTLIER ANALYSIS


Figure 8.6: Likelihood fit values versus outlier scores


probability densities:



n




f data(D|M) = f point(Xj |M).

(8.7)



j=1

The log-likelihood fit L(D|M) of the data set D with respect to M is the logarithm of the aforementioned expression, and can be (more conveniently) represented as a sum of values over the different data points:





n

n

k







L(D|M) = log( f point(




|M)) =

log(

αi · f i(




)).

(8.8)




Xj

Xj




j=1

j=1

i=1







This log- likelihood fit needs to be optimized to determine the model parameters. This objective function maximizes the fit of the data points to the generative model. For this purpose, the EM algorithm discussed in Sect. 6.5 of Chap. 6 is used.


After the parameters of the model have been determined, the value of f point(Xj |M) (or its logarithm) may be reported as the outlier score. The major advantage of such mixture models is that the mixture components can also incorporate domain knowledge about the shape of each individual mixture component. For example, if it is known that the data points in a particular cluster are correlated in a certain way, then this fact can be incorporated in the mixture model by fixing the appropriate parameters of the covariance matrix, and learning the remaining parameters. On the other hand, when the available data is limited, mixture models may overfit the data. This will cause data points that are truly outliers to be missed.


8.4 Clustering for Outlier Detection


The probabilistic algorithm of the previous section provides a preview of the relationship between clustering and outlier detection. Clustering is all about finding “crowds” of data points, whereas outlier analysis is all about finding data points that are far away from these crowds. Clustering and outlier detection, therefore, share a well-known complementary relationship. A simplistic view is that every data point is either a member of a cluster or an outlier. Clustering algorithms often have an “outlier handling” option that removes data




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   147   148   149   150   151   152   153   154   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin