Sum of square distances to centroids: In this case, the centroids of the different clusters are determined, and the sum of squared (SSQ) distances are reported as the corre-sponding objective function. Smaller values of this measure are indicative of better cluster quality. This measure is obviously more optimized to distance-based algo-rithms, such as k-means, as opposed to a density-based method, such as DBSCAN. Another problem with SSQ is that the absolute distances provide no meaningful infor-mation to the user about the quality of the underlying clusters.
Intracluster to intercluster distance ratio: This measure is more detailed than the SSQ measure. The idea is to sample r pairs of data points from the underlying data. Of these, let P be the set of pairs that belong to the same cluster found by the algorithm. The remaining pairs are denoted by set Q. The average intercluster distance and intracluster distance are defined as follows:
Intra =
|
|
|
|
|
dist(
|
|
|
,
|
|
|
)/|P |
|
(6.43)
|
|
|
|
|
|
Xi
|
Xj
|
|
|
(Xi,Xj )∈P
|
|
|
Inter =
|
|
|
|
|
dist(
|
|
,
|
|
)/|Q|.
|
(6.44)
|
|
|
|
|
|
Xi
|
Xj
|
|
|
(Xi,Xj )∈Q
|
|
|
Then the ratio of the average intracluster distance to the intercluster distance is given by Intra/Inter. Small values of this measure indicate better clustering behavior.
Silhouette coefficient: Let Davgiin be the average distance of Xi to data points within the cluster of Xi. The average distance of data point Xi to the points in each cluster (other than its own) is also computed. Let Dminouti represent the minimum of these
6.9. CLUSTER VALIDATION
|
197
|
(average) distances, over the other clusters. Then, the silhouette coefficient Si specific to the ith object, is as follows:
Dminout − Davgin
Si = max{ i out i in} . (6.45)
Dmini , Davgi
The overall silhouette coefficient is the average of the data point-specific coefficients. The silhouette coefficient will be drawn from the range (−1, 1). Large positive values indicate highly separated clustering, and negative values are indicative of some level of “mixing” of data points from different clusters. This is because Dminouti will be less than Davgiin only in cases where data point Xi is closer to at least one other cluster than its own cluster. One advantage of this coefficient is that the absolute values provide a good intuitive feel of the quality of the clustering.
Probabilistic measure: In this case, the goal is to use a mixture model to estimate the quality of a particular clustering. The centroid of each mixture component is assumed to be the centroid of each discovered cluster, and the other parameters of each component (such as the covariance matrix) are computed from the discovered clustering using a method similar to the M-step of EM algorithms. The overall log-likelihood of the measure is reported. Such a measure is useful when it is known from domain-specific knowledge that the clusters ought to have a specific shape, as is suggested by the distribution of each component in the mixture.
The major problem with internal measures is that they are heavily biased toward particular clustering algorithms. For example, a distance-based measure, such as the silhouette coeffi-cient, will not work well for clusters of arbitrary shape. Consider the case of the clustering in Fig. 6.11. In this case, some of the point- specific coefficients might have a negative value for the correct clustering. Even the overall silhouette coefficient for the correct clustering might not be as high as an incorrect k-means clustering, which mixes points from different clusters. This is because the clusters in Fig. 6.11 are of arbitrary shape that do not conform to the quality metrics of distance-based measures. On the other hand, if a density-based criterion were designed, it would also be biased toward density-based algorithms. The major problem in relative comparison of different methodologies with internal criteria is that all criteria attempt to define a “prototype” model for goodness. The quality measure very often
only tells us how well the prototype validation model matches the model used for discovering clusters, rather than anything intrinsic about the underlying clustering. This can be viewed as a form of overfitting, which significantly affects such evaluations. At the very least, this phenomenon creates uncertainty about the reliability of the evaluation, which defeats the purpose of evaluation in the first place. This problem is fundamental to the unsupervised nature of data clustering, and there are no completely satisfactory solutions to this issue.
Internal validation measures do have utility in some practical scenarios. For example, they can be used to compare clusterings by a similar class of algorithms, or different runs of the same algorithm. Finally, these measures are also sensitive to the number of clusters found by the algorithm. For example, two different clusterings cannot be compared on a particular criterion when the number of clusters determined by different algorithms is different. A fine-grained clustering will typically be associated with superior values of many internal qualitative measures. Therefore, these measures should be used with great caution, because of their tendency to favor specific algorithms, or different settings of the same algorithm. Keep in mind that clustering is an unsupervised problem, which, by definition, implies that there is no well-defined notion of a “correct” model of clustering in the absence of external criteria.
198 CHAPTER 6. CLUSTER ANALYSIS
Figure 6.24: Inflection points in validity measures for parameter tuning
6.9.1.1 Parameter Tuning with Internal Measures
All clustering algorithms use a number of parameters as input, such as the number of clusters or the density. Although internal measures are inherently flawed, a limited amount of parameter tuning can be performed with these measures. The idea here is that the variation in the validity measure may show an inflection point (or “elbow”) at the correct choice of parameter. Of course, because these measures are flawed to begin with, such techniques should be used with great caution. Furthermore, the shape of the inflection point may vary significantly with the nature of the parameter being tuned, and the validation measure being used. Consider the case of k -means clustering where the parameter being tuned is the number of clusters k. In such a case, the SSQ measure will always reduce with the number of clusters, though it will reduce at a sharply lower rate after the inflection point. On the other hand, for a measure such as the ratio of the intra-cluster to inter-cluster distance, the measure will reduce until the inflection point and then may increase slightly. An example of these two kinds of inflections are illustrated in Fig. 6.24. The X-axis indicates the parameter being tuned (number of clusters), and the Y -axis illustrates the (relative) values of the validation measures. In many cases, if the validation model does not reflect either the natural shape of the clusters in the data, or the algorithmic model used to create the clusters very well, such inflection points may either be misleading, or not even be observed. However, plots such as those illustrated in Fig. 6.24 can be used in conjunction with visual inspection of the scatter plot of the data and the algorithm partitioning to determine the correct number of clusters in many cases. Such tuning techniques with internal measures should be used as an informal rule of thumb, rather than as a strict criterion.
6.9.2 External Validation Criteria
Such criteria are used when ground truth is available about the true clusters in the under-lying data. In general, this is not possible in most real data sets. However, when synthetic data is generated from known benchmarks, it is possible to associate cluster identifiers with the generated records. In the context of real data sets, these goals can be approximately achieved with the use of class labels when they are available. The major risk with the use of class labels is that these labels are based on application-specific properties of that data set and may not reflect the natural clusters in the underlying data. Nevertheless, such criteria
Dostları ilə paylaş: |