7.2. CLUSTERING CATEGORICAL DATA
|
209
|
each data point with the inverse occurrence frequency of the corresponding attribute value. With normalized modes and weights associated with each attribute of each data point, the straightforward match-based similarity computation will provide effective results.
7.2.1.2 k-Medoids Clustering
The medoid-based clustering algorithms are easier to generalize to categorical data sets because the representative data point is chosen from the input database. The broad descrip-tion of the medoids approach remains the same as that described in Sect. 6.3.4 of the pre-vious chapter. The only difference is in terms of how the similarity is computed between a pair of categorical data points, as compared to numeric data. Any of the similarity functions discussed in Sect. 3.2.2 of Chap. 3 can be used for this purpose. As in the case of k-modes clustering, because the representative is also a categorical data point (as opposed to a his-togram), it is easier to directly use the categorical similarity functions of Chap. 3. These include the use of inverse occurrence frequency-based similarity functions that normalize for the skew across different attribute values.
7.2.2 Hierarchical Algorithms
Hierarchical algorithms are discussed in Sect. 6.4 of Chap. 6. Agglomerative bottom-up algorithms have been used successfully for categorical data. The approach in Sect. 6.4 has been described in a general way with a distance matrix of values. As long as a distance (or similarity) matrix can be defined for the case of categorical attributes, most of the algorithms discussed in the previous chapter can be easily applied to this case. An interesting hierarchical algorithm that works well for categorical data is ROCK.
7.2.2.1 ROCK
The ROCK (RObust Clustering using lin Ks) algorithm is based on an agglomerative bottom-up approach in which the clusters are merged on the basis of a similarity crite-rion. The ROCK algorithm uses a criterion that is based on the shared nearest-neighbor metric. Because agglomerative methods are somewhat expensive, the ROCK method applies the approach to only a sample of data points to discover prototype clusters. The remaining data points are assigned to one of these prototype clusters in a final pass.
The first step of the ROCK algorithm is to convert the categorical data to a binary representation using the binarization approach introduced in Chap. 2. For each value vj of categorical attribute i, a new pseudo-item is created, which has a value of 1, only if attribute i takes on the value vj . Therefore, if the ith attribute in a d-dimensional categorical data set
has ni different values, such an approach will create a binary data set with
|
d
|
|
i=1 ni binary
|
|
attributes. When the value of each ni is high, this binary data set will be sparse, and it will resemble a market basket data set. Thus, each data record can be treated as a binary transaction, or a set of items. The similarity between the two transactions is computed with the use of the Jaccard coefficient between the corresponding sets:
Dostları ilə paylaş: |