Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	115/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 111 112 113 114 115 116 117 118 ... 423

1-Data Mining tarjima

THE THREE DENSELY CONNECTED CONNECTED
THE k NEAREST NEIGHBOR GRAPH NEIGHBOR GRAPH
CLUSTER D (DENSE) CLUSTER E (DENSE)

CLUSTER C (SPARSE)

CLUSTER A
(ARBITRARY SHAPE)
THE TWO DENSELY	THE THREE DENSELY
CONNECTED	CONNECTED
COMMUNITIES OF	COMMUNITIES OF
THE k NEAREST	THE k NEAREST
NEIGHBOR GRAPH	NEIGHBOR GRAPH
CLUSTER B
	CLUSTER D (DENSE)
	CLUSTER E (DENSE)
(a) Varying cluster shape	(b) Varying cluster density

Figure 6.21: The merits of the k-nearest neighbor graph for handling clusters of varying shape and density

neighborhood graph are implicitly determined by agglomerating locally dense regions. As discussed in the previous section on density-based clustering, the agglomeration of locally dense regions corresponds to arbitrarily shaped clusters. For example, in Fig. 6.21a, the data points in the arbitrarily shaped cluster A will be densely connected to one another in the k-nearest neighbor graph, but they will not be significantly connected to data points in cluster B. As a result, any community detection algorithm will be able to discover the two clusters A and B on the graph representation.

Graph-based methods are also able to adjust much better to local variations in data density (see Fig. 6.14) when they use the k-nearest neighbors to construct the neighborhood graph rather than an absolute distance threshold. This is because the k-nearest neighbors of a node are chosen on the basis of relative comparison of distances within the locality of a data point whether they are large or small. For example, in Fig. 6.21b, even though clusters D and E are closer to each other than any pair of data points in sparse cluster C, all three clusters should be considered distinct clusters. Interestingly, a k-nearest neighbor graph will not create too many cross-connections between these clusters for small values of k. Therefore, all three clusters will be found by a community detection algorithm on the k-nearest neighbor graph in spite of their varying density. Therefore, graph-based methods can provide better results than algorithms such as DBSCAN because of their ability to adjust to varying local density in addition to their ability to discover arbitrarily shaped clusters. This desirable property of k -nearest neighbor graph algorithms is not restricted to the use of spectral clustering methods in the final phase. Many other graph-based algorithms have also been shown to discover arbitrarily shaped clusters in a locality-sensitive way. These desirable properties are therefore embedded within the k-nearest neighbor graph representation and are generalizable5 to other data mining problems such as outlier analysis. Note that the locality-sensitivity of the shared nearest neighbor similarity function (cf. Sect. 3.2.1.8 of Chap. 3) is also due to the same reason. The locality-sensitivity of many classical clustering algorithms, such as k-medoids, bottom-up algorithms, and DBSCAN, can be improved by incorporating graph-based similarity functions such as the shared nearest neighbor method.

On the other hand, high computational costs are the major drawback of graph-based algorithms. It is often expensive to apply the approach to an n × n matrix of similari-ties. Nevertheless, because similarity graphs are sparse, many recent community detection methods can exploit this sparsity to provide more eﬃcient solutions.

See [257], which is a graph-based alternative to the LOF algorithm for locality-sensitive outlier analysis.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 111 112 113 114 115 116 117 118 ... 423