Low values of the Gini index are desirable. The notion of the Gini index is closely related to the notion of entropy E j (of algorithm-determined cluster j), which measures the same intuitive characteristics of the data:
|
kt
|
mij
|
|
mij
|
|
|
|
Ej = −
|
|
· log
|
.
|
(6.52)
|
|
i=1
|
|
Mj
|
Mj
|
|
|
|
|
|
|
|
|
|
Lower values of the entropy are indicative of a higher quality clustering. The overall entropy is computed in a similar way to the Gini index, with the use of cluster specific entropies.
|
kd
|
|
|
Eaverage =
|
j=1 Ej · Mj
|
.
|
(6.53)
|
|
|
|
|
kd
|
|
|
|
j=1 Mj
|
|
|
Finally, a pairwise precision and pairwise recall measure can be used to evaluate the quality of a clustering. To compute this measure, all pairs of data points within the same algorithm-determined cluster are generated. The fraction of pairs which belong to the same ground-truth clusters is the precision. To determine the recall, pairs of points within the same ground-truth clusters are sampled, and the fraction that appear in the same algorithm-determined cluster are computed. A unified measure is the Fowlkes-Mallows measure, which reports the geometric mean of the precision and recall.
6.9.3 General Comments
Although cluster validation is a widely studied problem in the clustering literature, most methods for cluster validation are rather imperfect. Internal measures are imperfect because they are typically biased toward one algorithm or the other. External measures are imperfect because they work with class labels that may not reflect the true clusters in the data. Even when synthetic data is generated, the method of generation will implicitly favor one algorithm or the other. These challenges arise because clustering is an unsupervised problem, and it is notoriously difficult to validate the quality of such algorithms. Often, the only true measure of clustering quality is its ability to meet the goals of a specific application.
6.10 Summary
A wide variety of algorithms have been designed for the problem of data clustering, such as representative-based methods, hierarchical methods, probabilistic methods, density-based methods, graph-based methods, and matrix factorization-based methods. All methods typ-ically require the algorithm to specify some parameters, such as the number of clusters, the density, or the rank of the matrix factorization. Representative-based methods, and probabilistic methods restrict the shape of the clusters but adjust better to varying cluster density. On the other hand, agglomerative and density-based methods adjust better to the shape of the clusters but do not adjust to varying density of the clusters. Graph- based methods provide the best adjustment to varying shape and density but are typically more expensive to implement. The problem of cluster validation is a notoriously difficult one for unsupervised problems, such as clustering. Although external and internal validation crite-ria are available for the clustering, they are often biased toward different algorithms, or may not accurately reflect the internal clusters in the underlying data. Such measures should be used with caution.
6.11 Bibliographic Notes
The problem of clustering has been widely studied in the data mining and machine learn-ing literature. The classical books [74, 284, 303] discuss most of the traditional clustering methods. These books present many of the classical algorithms, such as the partitioning and hierarchical algorithms, in great detail. Another book [219] discusses more recent methods
202 CHAPTER 6. CLUSTER ANALYSIS
for data clustering. An excellent survey on data clustering may be found in [285]. The most recent book [32 ] in the literature provides a very comprehensive overview of the different data clustering algorithms. A detailed discussion on feature selection methods is provided in [366]. The distance-based entropy measure is discussed in [169]. Various validity mea-sures derived from spectral clustering and the cluster scatter matrix can be used for feature selection [262, 350, 550]. The second chapter in the clustering book [32] provides a detailed review of feature selection methods.
A classical survey [ 285] provides an excellent review of k-means algorithms. The problem of refining the initial data points for k-means type algorithms is discussed in [108]. The problem of discovering the correct number of clusters in a k-means algorithm is addressed in [423]. Other notable criteria for representative algorithms include the use of Bregman divergences [79].
The three main density-based algorithms presented in this chapter are STING [506], DBSCAN [197], and DENCLUE [267 ]. The faster update rule for DENCLUE appears in [269]. The faster update rule was independently discovered earlier in [148, 159] as mean-shift clustering. Among the grid-based algorithms, the most common ones include WaveCluster [464] and MAFIA [231]. The incremental version of DBSCAN is addressed in [198 ]. The OPTICS algorithm [76] performs density-based clustering based on ordering of the data points. It is also useful for hierarchical clustering and visualization. Another variation of the DBSCAN algorithm is the GDBSCAN method [444] that can work with more general kinds of data.
One of the most well-known graph-based algorithms is the Chameleon algorithm [300]. Shared nearest neighbor algorithms [195], are inherently graph-based algorithms, and adjust well to the varying density in different data localities. A well-known top-down hierar-chical multilevel clustering algorithm is the METIS algorithm [301]. An excellent survey on spectral clustering methods may be found in [371 ]. Matrix factorization and its vari-ations [288, 440, 456] are closely related to spectral clustering [185]. Methods for com-munity detection in graphs are discussed in [212]. Any of these methods can be used for the last phase of graph-based clustering algorithms. Cluster validity methods are discussed in [247, 248]. In addition, the problem of cluster validity is studied in detail in [32].
6.12 Exercises
Consider the 1-dimensional data set with 10 data points {1, 2, 3, . . . 10}. Show three iterations of the k-means algorithms when k = 2, and the random seeds are initialized to {1, 2}.
Repeat Exercise 1 with an initial seed set of {2, 9}. How did the different choice of the seed set affect the quality of the results?
Write a computer program to implement the k-representative algorithm. Use a mod-ular program structure, in which the distance function and centroid determination are separate subroutines. Instantiate these subroutines to the cases of (i) the k-means algorithm, and (ii) the k-medians algorithm.
Implement the Mahalanobis k-means algorithm.
Consider the 1-dimensional data set {1 . . . 10}. Apply a hierarchical agglomerative approach, with the use of minimum, maximum, and group average criteria for merging. Show the first six merges.
Dostları ilə paylaş: |