Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə164/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   160   161   162   163   164   165   166   167   ...   423
1-Data Mining tarjima

9.2. OUTLIER DETECTION WITH CATEGORICAL DATA







267




The fit probability of the data point to the rth

component is given by α






















r ·g

r,Θ

(X). Therefore,







k




the sum of the fits over all components is given by

r=1 αr · gr,Θ(




). This fit represents the




X




likelihood of the data point being generated by the model. Therefore, it is used as the outlier score. However, in order to compute this fit value, one needs to estimate the parameters Θ. This is achieved with the EM algorithm.
The assignment (or posterior) probability P (Gm|X, Θ) for the mth cluster may be esti-



mated as follows:










αm · gm,Θ(







)










P (

Gm|




Θ) =

X

.

(9.2)




X,






















k




r,Θ




























αr · g

(X)



















r=1













This step provides a soft assignment probability of the data point to a cluster, and it corresponds to the E-step.


The soft -assignment probability is used to estimate the probability pijm. While esti-mating the parameters for cluster m, the weight of a data point is assumed to be equal to its assignment probability P (Gm |X, Θ) to cluster m. The value αm is estimated to be the average assignment probability to cluster m over all data points. For each cluster m, the weighted number wijm of records for which the ith attribute takes on the j th possible discrete value in cluster m is estimated. The value of wijm is estimated as the sum of the posterior probabilities P (Gm|X, Θ) for all records X in which the ith attribute takes on the jth value. Then, the value of the probability pijm may be estimated as follows:




pijm = X D P (Gm|X, Θ) . (9.3)

When the number of data points is small, the estimation of Eq. 9.3 can be difficult to perform in practice. In such cases, some of the attribute values may not appear in a cluster (or wijm 0). This situation can lead to poor parameter estimation, or overfitting. The Laplacian smoothing method is commonly used to address such ill-conditioned probabilities. Let mi be the number of distinct attribute values of categorical attribute i. In Laplacian smoothing, a small value β is added to the numerator, and mi · β is added to the denom-inator of Eq. 9.3. Here, β is a parameter that controls the level of smoothing. This form of smoothing is also sometimes applied in the estimation of the prior probabilities α i when the data sets are very small. This completes the description of the M-step. As in the case of numerical data, the E- and M-steps are iterated to convergence. The maximum likelihood fit value is reported as the outlier score.


9.2.2 Clustering and Distance-Based Methods


Most of the clustering- and distance-based methods can be generalized easily from numerical to categorical data. Two main modifications required are as follows:





  1. Categorical data requires specialized clustering methods that are typically different from those of numerical data. This is discussed in detail in Chap. 7. Any of these mod-els can be used to create the initial set of clusters. If a distance- or similarity-based clus-tering algorithm is used, then the same distance or similarity function should be used to compute the distance (similarity) of the candidate point to the cluster centroids.




  1. The choice of the similarity function is important in the context of categorical data, whether a centroid-based algorithm is used or a raw, distance-based algorithm is used. The distance functions in Sect. 3.2.2 of Chap. 3 can be very useful for this task. The pruning tricks for distance-based algorithms are agnostic to the choice of distance

268 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS


function and can therefore be generalized to this case. Many of the local methods, such as LOF, can be generalized to this case as well with the use of this modified definition of distances.


Therefore, clustering and distance-based methods can be generalized to the scenario of categorical data with relatively modest modifications.


9.2.3 Binary and Set-Valued Data


Binary data are a special kind of categorical data, which occur quite frequently in many real scenarios. The chapters on frequent pattern mining were based on these kind of data. Furthermore, both categorical and numerical data can always be converted to binary data. One common characteristic of this domain is that, while the number of attributes is large, the number of nonzero values of the attribute is small in a typical transaction.


Frequent pattern mining is used as a subroutine for the problem of outlier detection in these cases. The basic idea is that frequent patterns are much less likely to occur in outlier transactions. Therefore, one possible measure is to use the sum of all the supports of frequent patterns occurring in a particular transaction. The total sum is normalized by dividing with the number of frequent patterns. This provides an outlier score for the pattern. Strictly speaking, the normalization can be omitted from the final score, because it is the same across all transactions.


Let D be a transaction database containing the transactions denoted by T1 . . . TN . Let s(X, D) represent the support of itemset X in D. Therefore, if F P S(D, sm) represents the set of frequent patterns in the database D at minimum support level sm, then, the frequent pattern outlier factor F P OF (Ti) of a transaction Ti ∈ D at minimum support sm is defined





as follows:


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   160   161   162   163   164   165   166   167   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin