GroupLink(Ci, Cj )
|
.
|
(7.3)
|
|
E[CrossLinks(Ci, Cj )]
|
|
|
|
|
|
|
|
The expected number of cross-links between Ci and Cj can be computed as function of the expected number of intracluster links Intra(·) in individual clusters as follows:
E[CrossLinks(Ci, Cj )] = E[Intra(Ci ∪ Cj )] − E[Intra(Ci)] − E[Intra(Cj )].
|
(7.4)
|
The expected number of intracluster links is specific to a single cluster and is more easily estimated as a function of cluster size qi and θ. The number of intracluster links in a cluster
containing qi data points is heuristically estimated by the ROCK algorithm as qi1+2·f (θ). Here, the function f (θ) is a property of both the data set, and the kind of clusters that one is interested in. The value of f (θ) is heuristically defined as follows:
f (θ) = 11 −+ θθ . (7.5)
Therefore, by substituting the expected number of cross-links in Eq. 7.3, one obtains the following merging criterion V (Ci, Cj ):
V (
|
Ci
|
,
|
Cj
|
) =
|
|
GroupLink(Ci, Cj )
|
1+2·f (θ)
|
.
|
(7.6)
|
|
|
|
|
(qi + qj )
|
1+2
|
·
|
f (θ)
|
1+2·f (θ)
|
|
|
|
|
|
|
|
|
|
|
|
− qi
|
− qj
|
|
|
|
The denominator explicitly normalizes for the sizes of the clusters being merged by penal-izing larger clusters. The goal of this kind of normalization is to prevent the imbalanced preference toward successively merging only large clusters.
The merges are successively performed until a total of k clusters remain in the data. Because the agglomerative approach is applied only to a sample of the data, it remains to assign the remaining data points to one of the clusters. This can be achieved by assigning each disk-resident data point to the cluster with which it has the greatest similarity. This similarity is computed using the same quality criterion in Eq. 7.6 as was used for cluster– cluster merges. In this case, similarity is computed between clusters and individual data points by treating each data point as a singleton cluster.
7.2. CLUSTERING CATEGORICAL DATA
|
211
|
7.2.3 Probabilistic Algorithms
The probabilistic approach to data clustering is introduced in Sect. 6.5 of Chap. 6. Gen-erative models can be generalized to virtually any data type as long as an appropriate generating probability distribution can be defined for each mixture component. This pro-vides unprecedented flexibility in adapting probabilistic clustering algorithms to various data types. After the mixture distribution model has been defined, the E- and M-steps need to be defined for the corresponding expectation–maximization (EM) approach. The main difference from numeric clustering is that the soft assignment process in the E-step, and the parameter estimation process in the M -step will depend on the relevant probability distribution model for the corresponding data type.
Let the k components of the mixture be denoted by G1 . . . Gk. Then, the generative process for each point in the data set D uses the following two steps:
Select a mixture component with prior probability αi, where i ∈ {1 . . . k}.
If the mth component of the mixture was selected in the first step, then generate a data point from Gm.
The values of αi denote the prior probabilities P (G i ), which need to be estimated along with other model parameters in a data-driven manner. The main difference from the numerical case is in the mathematical form of the generative model for the mth cluster (or mixture component) Gm, which is now a discrete probability distribution rather than the probability density function used in the numeric case. This difference reflects the corresponding differ-ence in data type. One reasonable choice for the discrete probability distribution of Gm is to assume that the jth categorical value of ith attribute is independently generated by mix-ture component (cluster) m with probability pijm. Consider a data point X containing the attribute value indices j1 . . . jd for its d dimensions. In other words, the rth attribute takes on the jrth possible categorical value. For convenience, the entire set of model parameters is denoted by the generic notation Θ. Then, the discrete probability distribution gm,Θ(X) from cluster m is given by the following expression:
|
|
d
|
|
|
gm,Θ(
|
|
) =prjr m.
|
(7.7)
|
|
X
|
|
|
|
r=1
|
|
|
The discrete probability distribution is gm,Θ(·), which is analogous to the continuous density function f m,Θ(·) of the EM model in the previous chapter. Correspondingly, the posterior probability P (G m|X, Θ) of the component Gm having generated observed data point X may be estimated as follows:
|
|
|
|
αm · gm,Θ(
|
|
)
|
.
|
|
|
P (
|
Gm|
|
|
,Θ) =
|
X
|
(7.8)
|
|
X
|
|
|
|
|
|
j
|
|
k
|
αr · g
|
r,Θ
|
(X)
|
|
|
|
|
|
|
r=1
|
|
|
|
|
This defines the E-step for categorical data, and it provides a soft assignment probability of the data point to a cluster.
After the soft assignment probability has been determined, the M-step applies maximum likelihood estimation to the individual components of the mixture to estimate the probability pijm. While estimating the parameters for cluster m, the weight of a record is assumed to be equal to its assignment probability P (Gm|X, Θ) to cluster m. For each cluster m, the weighted number wijm of data points for which attribute i takes on its jth possible categorical value is estimated. This is equal to the sum of the assignment probabilities (to
wijm
212 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS
cluster m) of data points that do take on the jth value. By dividing this value with the aggregate assignment probability of all data points to cluster m, the probability pijm may
be estimated as follows:
pijm = X D P (Gm|X, Θ) . (7.9)
The parameter αm is estimated as the average assignment probabilities of data points to cluster m. The aforementioned formulas for estimation may be derived from maximum likelihood estimation methods. Refer to the bibliographic notes for detailed derivations.
Sometimes, the estimation of Eq. 7.9 can be inaccurate because the available data may be limited, or particular values of categorical attributes may be rare. In such cases, some of the attribute values may not appear in a cluster (or wijm ≈ 0). This can lead to poor parameter estimation, or overfitting. The Laplacian smoothing method is commonly used to address such ill-conditioned probabilities. This is achieved by adding a small positive value
to the estimated values of wijm, where β is the smoothing parameter. This will generally lead to more robust estimation. This type of smoothing is also applied in the estimation of the prior probabilities αm when the data sets are very small. This completes the description of the M-step. As in the case of numerical data, the E-step and M-step are iterated to convergence.
7.2.4 Graph-Based Algorithms
Because graph-based methods are meta-algorithms, the broad description of these algo-rithms remains virtually the same for categorical data as for numeric data. Therefore, the approach described in Sect. 6.7 of the previous chapter applies to this case as well. The only difference is in terms of how the edges and values on the similarity graph are constructed. The first step is the determination of the k -nearest neighbors of each data record, and sub-sequent assignment of similarity values to edges. Any of the similarity functions described in Sect. 3.2.2 of Chap. 3 can be used to compute similarity values along the edges of the graph. These similarity measures could include the inverse occurrence frequency measure discussed in Chap. 3, which corrects for the natural skew in the different attribute values. As discussed in the previous chapter, one of the major advantages of graph-based algorithms is that they can be leveraged for virtually any kind of data type as long as a similarity function can be defined on that data type.
7.3 Scalable Data Clustering
In many applications, the size of the data is very large. Typically, the data cannot be stored in main memory, but it need to reside on disk. This is a significant challenge, because it imposes a constraint on the algorithmic design of clustering algorithms. This section will discuss the CLARANS, BIRCH, and CURE algorithms. These algorithms are all scalable implementations of one of the basic types of clustering algorithms discussed in the pre-vious chapter. For example, the CLARANS approach is a scalable implementation of the k-medoids algorithm for clustering. The BIRCH algorithm is a top-down hierarchical gen-eralization of the k-means algorithm. The CURE algorithm is a bottom-up agglomerative approach to clustering. These different algorithms inherit the advantages and disadvan-tages of the base classes of algorithms that they are generalized from. For example, while the CLARANS algorithm has the advantage of being more easily generalizable to different data types (beyond numeric data), it inherits the relatively high computational complexity
Dostları ilə paylaş: |