Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	164/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 160 161 162 163 164 165 166 167 ... 423

1-Data Mining tarjima

9.2. OUTLIER DETECTION WITH CATEGORICAL DATA

267

The fit probability of the data point to the rth

component is given by α

_r·g

r,Θ

(X). Therefore,

the sum of the fits over all components is given by

_r₌₁α_r · g^r,^Θ(

). This fit represents the

likelihood of the data point being generated by the model. Therefore, it is used as the outlier score. However, in order to compute this fit value, one needs to estimate the parameters Θ. This is achieved with the EM algorithm.
The assignment (or posterior) probability P (G_m|X, Θ) for the mth cluster may be esti-

mated as follows:

α_m · g^m,^Θ(

)

P (

G_m|

Θ) =

(9.2)

r,Θ

α_r · g

(X)

r=1

This step provides a soft assignment probability of the data point to a cluster, and it corresponds to the E-step.

The soft -assignment probability is used to estimate the probability p_ijm. While esti-mating the parameters for cluster m, the weight of a data point is assumed to be equal to its assignment probability P (G_m |X, Θ) to cluster m. The value α_m is estimated to be the average assignment probability to cluster m over all data points. For each cluster m, the weighted number w_ijm of records for which the ith attribute takes on the j th possible discrete value in cluster m is estimated. The value of w_ijm is estimated as the sum of the posterior probabilities P (G_m|X, Θ) for all records X in which the ith attribute takes on the jth value. Then, the value of the probability p_ijm may be estimated as follows:

^p^ijm⁼ _X_DP (G_m|X, Θ) ^. ^(9.3)

When the number of data points is small, the estimation of Eq. 9.3 can be diﬃcult to perform in practice. In such cases, some of the attribute values may not appear in a cluster (or w_ijm ≈ 0). This situation can lead to poor parameter estimation, or overfitting. The Laplacian smoothing method is commonly used to address such ill-conditioned probabilities. Let m_i be the number of distinct attribute values of categorical attribute i. In Laplacian smoothing, a small value β is added to the numerator, and m_i · β is added to the denom-inator of Eq. 9.3. Here, β is a parameter that controls the level of smoothing. This form of smoothing is also sometimes applied in the estimation of the prior probabilities α _i when the data sets are very small. This completes the description of the M-step. As in the case of numerical data, the E- and M-steps are iterated to convergence. The maximum likelihood fit value is reported as the outlier score.

9.2.2 Clustering and Distance-Based Methods

Most of the clustering- and distance-based methods can be generalized easily from numerical to categorical data. Two main modifications required are as follows:

Categorical data requires specialized clustering methods that are typically diﬀerent from those of numerical data. This is discussed in detail in Chap. 7. Any of these mod-els can be used to create the initial set of clusters. If a distance- or similarity-based clus-tering algorithm is used, then the same distance or similarity function should be used to compute the distance (similarity) of the candidate point to the cluster centroids.

The choice of the similarity function is important in the context of categorical data, whether a centroid-based algorithm is used or a raw, distance-based algorithm is used. The distance functions in Sect. 3.2.2 of Chap. 3 can be very useful for this task. The pruning tricks for distance-based algorithms are agnostic to the choice of distance

268 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS

function and can therefore be generalized to this case. Many of the local methods, such as LOF, can be generalized to this case as well with the use of this modified definition of distances.

Therefore, clustering and distance-based methods can be generalized to the scenario of categorical data with relatively modest modifications.

9.2.3 Binary and Set-Valued Data

Binary data are a special kind of categorical data, which occur quite frequently in many real scenarios. The chapters on frequent pattern mining were based on these kind of data. Furthermore, both categorical and numerical data can always be converted to binary data. One common characteristic of this domain is that, while the number of attributes is large, the number of nonzero values of the attribute is small in a typical transaction.

Frequent pattern mining is used as a subroutine for the problem of outlier detection in these cases. The basic idea is that frequent patterns are much less likely to occur in outlier transactions. Therefore, one possible measure is to use the sum of all the supports of frequent patterns occurring in a particular transaction. The total sum is normalized by dividing with the number of frequent patterns. This provides an outlier score for the pattern. Strictly speaking, the normalization can be omitted from the final score, because it is the same across all transactions.

Let D be a transaction database containing the transactions denoted by T₁ . . . T_N . Let s(X, D) represent the support of itemset X in D. Therefore, if F P S(D, s_m) represents the set of frequent patterns in the database D at minimum support level s_m, then, the frequent pattern outlier factor F P OF (T_i) of a transaction T_i ∈ D at minimum support s_m is defined

as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 160 161 162 163 164 165 166 167 ... 423