CLUSTER C (SPARSE)
CLUSTER A
|
|
(ARBITRARY SHAPE)
|
|
THE TWO DENSELY
|
THE THREE DENSELY
|
CONNECTED
|
CONNECTED
|
COMMUNITIES OF
|
COMMUNITIES OF
|
THE k NEAREST
|
THE k NEAREST
|
NEIGHBOR GRAPH
|
NEIGHBOR GRAPH
|
CLUSTER B
|
|
|
CLUSTER D (DENSE)
|
|
CLUSTER E (DENSE)
|
(a) Varying cluster shape
|
(b) Varying cluster density
|
Figure 6.21: The merits of the k-nearest neighbor graph for handling clusters of varying shape and density
neighborhood graph are implicitly determined by agglomerating locally dense regions. As discussed in the previous section on density-based clustering, the agglomeration of locally dense regions corresponds to arbitrarily shaped clusters. For example, in Fig. 6.21a, the data points in the arbitrarily shaped cluster A will be densely connected to one another in the k-nearest neighbor graph, but they will not be significantly connected to data points in cluster B. As a result, any community detection algorithm will be able to discover the two clusters A and B on the graph representation.
Graph-based methods are also able to adjust much better to local variations in data density (see Fig. 6.14) when they use the k-nearest neighbors to construct the neighborhood graph rather than an absolute distance threshold. This is because the k-nearest neighbors of a node are chosen on the basis of relative comparison of distances within the locality of a data point whether they are large or small. For example, in Fig. 6.21b, even though clusters D and E are closer to each other than any pair of data points in sparse cluster C, all three clusters should be considered distinct clusters. Interestingly, a k-nearest neighbor graph will not create too many cross-connections between these clusters for small values of k. Therefore, all three clusters will be found by a community detection algorithm on the k-nearest neighbor graph in spite of their varying density. Therefore, graph-based methods can provide better results than algorithms such as DBSCAN because of their ability to adjust to varying local density in addition to their ability to discover arbitrarily shaped clusters. This desirable property of k -nearest neighbor graph algorithms is not restricted to the use of spectral clustering methods in the final phase. Many other graph-based algorithms have also been shown to discover arbitrarily shaped clusters in a locality-sensitive way. These desirable properties are therefore embedded within the k-nearest neighbor graph representation and are generalizable5 to other data mining problems such as outlier analysis. Note that the locality-sensitivity of the shared nearest neighbor similarity function (cf. Sect. 3.2.1.8 of Chap. 3) is also due to the same reason. The locality-sensitivity of many classical clustering algorithms, such as k-medoids, bottom-up algorithms, and DBSCAN, can be improved by incorporating graph-based similarity functions such as the shared nearest neighbor method.
On the other hand, high computational costs are the major drawback of graph-based algorithms. It is often expensive to apply the approach to an n × n matrix of similari-ties. Nevertheless, because similarity graphs are sparse, many recent community detection methods can exploit this sparsity to provide more efficient solutions.
See [257], which is a graph-based alternative to the LOF algorithm for locality-sensitive outlier analysis.
Dostları ilə paylaş: |