Data Mining: The Textbook

GroupLink(C_i, C_j )

E[CrossLinks(C_i, C_j )]

E[CrossLinks(C_i, C_j )]

Yüklə 17,13 Mb.

səhifə	129/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 125 126 127 128 129 130 131 132 ... 423

1-Data Mining tarjima

The expected number of cross-links between C_i and C_j can be computed as function of the expected number of intracluster links Intra(·) in individual clusters as follows:

E[CrossLinks(C_i, C_j )] = E[Intra(C_i ∪ C_j )] − E[Intra(C_i)] − E[Intra(C_j )].

(7.4)

The expected number of intracluster links is specific to a single cluster and is more easily estimated as a function of cluster size q_i and θ. The number of intracluster links in a cluster

containing q_i data points is heuristically estimated by the ROCK algorithm as q_i¹⁺²^·f ⁽^θ⁾. Here, the function f (θ) is a property of both the data set, and the kind of clusters that one is interested in. The value of f (θ) is heuristically defined as follows:

f (θ) = ¹₁ ⁻₊ ^θ_θ . (7.5)

Therefore, by substituting the expected number of cross-links in Eq. 7.3, one obtains the following merging criterion V (C_i, C_j ):

V (

C_i

C_j

) =

GroupLink(C_i, C_j )

1+2·f (θ)

(7.6)

(q_i + q_j )

1+2

f (θ)

1+2·f (θ)

− q_i

− q_j

The denominator explicitly normalizes for the sizes of the clusters being merged by penal-izing larger clusters. The goal of this kind of normalization is to prevent the imbalanced preference toward successively merging only large clusters.

The merges are successively performed until a total of k clusters remain in the data. Because the agglomerative approach is applied only to a sample of the data, it remains to assign the remaining data points to one of the clusters. This can be achieved by assigning each disk-resident data point to the cluster with which it has the greatest similarity. This similarity is computed using the same quality criterion in Eq. 7.6 as was used for cluster– cluster merges. In this case, similarity is computed between clusters and individual data points by treating each data point as a singleton cluster.

7.2. CLUSTERING CATEGORICAL DATA

211

7.2.3 Probabilistic Algorithms

The probabilistic approach to data clustering is introduced in Sect. 6.5 of Chap. 6. Gen-erative models can be generalized to virtually any data type as long as an appropriate generating probability distribution can be defined for each mixture component. This pro-vides unprecedented flexibility in adapting probabilistic clustering algorithms to various data types. After the mixture distribution model has been defined, the E- and M-steps need to be defined for the corresponding expectation–maximization (EM) approach. The main diﬀerence from numeric clustering is that the soft assignment process in the E-step, and the parameter estimation process in the M -step will depend on the relevant probability distribution model for the corresponding data type.

Let the k components of the mixture be denoted by G₁ . . . G_k. Then, the generative process for each point in the data set D uses the following two steps:

Select a mixture component with prior probability α_i, where i ∈ {1 . . . k}.

If the mth component of the mixture was selected in the first step, then generate a data point from G_m.

The values of α_i denote the prior probabilities P (G _i ), which need to be estimated along with other model parameters in a data-driven manner. The main diﬀerence from the numerical case is in the mathematical form of the generative model for the mth cluster (or mixture component) G_m, which is now a discrete probability distribution rather than the probability density function used in the numeric case. This diﬀerence reflects the corresponding diﬀer-ence in data type. One reasonable choice for the discrete probability distribution of G_m is to assume that the jth categorical value of ith attribute is independently generated by mix-ture component (cluster) m with probability p_ijm. Consider a data point X containing the attribute value indices j₁ . . . j_d for its d dimensions. In other words, the rth attribute takes on the j_rth possible categorical value. For convenience, the entire set of model parameters is denoted by the generic notation Θ. Then, the discrete probability distribution gm,Θ(X) from cluster m is given by the following expression:

		d
_gm,Θ(		) =^prj_r m^.	(7.7)
_gm,Θ(	X	) =^prj_r m^.	(7.7)
		r=1

The discrete probability distribution is gm,Θ(·), which is analogous to the continuous density function f m,Θ(·) of the EM model in the previous chapter. Correspondingly, the posterior probability P (G _m|X, Θ) of the component G_m having generated observed data point X may be estimated as follows:

α_m · g^m,^Θ(

)

P (

G_m|

,Θ) =

(7.8)

α_r · g

r,Θ

(X)

r=1

This defines the E-step for categorical data, and it provides a soft assignment probability of the data point to a cluster.

After the soft assignment probability has been determined, the M-step applies maximum likelihood estimation to the individual components of the mixture to estimate the probability p_ijm. While estimating the parameters for cluster m, the weight of a record is assumed to be equal to its assignment probability P (G_m|X, Θ) to cluster m. For each cluster m, the weighted number w_ijm of data points for which attribute i takes on its jth possible categorical value is estimated. This is equal to the sum of the assignment probabilities (to

^wijm
212 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

cluster m) of data points that do take on the jth value. By dividing this value with the aggregate assignment probability of all data points to cluster m, the probability p_ijm may

be estimated as follows:

^p^ijm⁼ _X_DP (G_m|X, Θ) ^. ^(7.9)

The parameter α_m is estimated as the average assignment probabilities of data points to cluster m. The aforementioned formulas for estimation may be derived from maximum likelihood estimation methods. Refer to the bibliographic notes for detailed derivations.

Sometimes, the estimation of Eq. 7.9 can be inaccurate because the available data may be limited, or particular values of categorical attributes may be rare. In such cases, some of the attribute values may not appear in a cluster (or w_ijm ≈ 0). This can lead to poor parameter estimation, or overfitting. The Laplacian smoothing method is commonly used to address such ill-conditioned probabilities. This is achieved by adding a small positive value

to the estimated values of w_ijm, where β is the smoothing parameter. This will generally lead to more robust estimation. This type of smoothing is also applied in the estimation of the prior probabilities α_m when the data sets are very small. This completes the description of the M-step. As in the case of numerical data, the E-step and M-step are iterated to convergence.

7.2.4 Graph-Based Algorithms

Because graph-based methods are meta-algorithms, the broad description of these algo-rithms remains virtually the same for categorical data as for numeric data. Therefore, the approach described in Sect. 6.7 of the previous chapter applies to this case as well. The only diﬀerence is in terms of how the edges and values on the similarity graph are constructed. The first step is the determination of the k -nearest neighbors of each data record, and sub-sequent assignment of similarity values to edges. Any of the similarity functions described in Sect. 3.2.2 of Chap. 3 can be used to compute similarity values along the edges of the graph. These similarity measures could include the inverse occurrence frequency measure discussed in Chap. 3, which corrects for the natural skew in the diﬀerent attribute values. As discussed in the previous chapter, one of the major advantages of graph-based algorithms is that they can be leveraged for virtually any kind of data type as long as a similarity function can be defined on that data type.

7.3 Scalable Data Clustering

In many applications, the size of the data is very large. Typically, the data cannot be stored in main memory, but it need to reside on disk. This is a significant challenge, because it imposes a constraint on the algorithmic design of clustering algorithms. This section will discuss the CLARANS, BIRCH, and CURE algorithms. These algorithms are all scalable implementations of one of the basic types of clustering algorithms discussed in the pre-vious chapter. For example, the CLARANS approach is a scalable implementation of the k-medoids algorithm for clustering. The BIRCH algorithm is a top-down hierarchical gen-eralization of the k-means algorithm. The CURE algorithm is a bottom-up agglomerative approach to clustering. These diﬀerent algorithms inherit the advantages and disadvan-tages of the base classes of algorithms that they are generalized from. For example, while the CLARANS algorithm has the advantage of being more easily generalizable to diﬀerent data types (beyond numeric data), it inherits the relatively high computational complexity

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 125 126 127 128 129 130 131 132 ... 423