Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	128/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 124 125 126 127 128 129 130 131 ... 423

1-Data Mining tarjima

Sim(T_i, T_j ) =	\|T_i ∩ T_j \|	.	(7.1)


	\|T_i ∪ T_j \|

Subsequently, two data points T_i and T_j are defined to be neighbors, if the similarity Sim(T_i, T _j ) between them is greater than a threshold θ. Thus, the concept of neighbors implicitly defines a graph structure on the data items, where the nodes correspond to

210 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS

the data items, and the links correspond to the neighborhood relations. The notation Link(T_i , T _j ) denotes a shared nearest-neighbor similarity function, which is equal to the number of shared nearest neighbors between T_i and T_j .

The similarity function Link(T _i, T_j ) provides a merging criterion for agglomerative algo-rithms. The algorithm starts with each data point (from the initially chosen sample) in its own cluster and then hierarchically merges clusters based on a similarity criterion between clusters. Intuitively, two clusters C₁ and C₂ should be merged, if the cumulative number of shared nearest neighbors between objects in C₁ and C₂ is large. Therefore, it is possible to generalize the notion of link-based similarity using clusters as arguments, as opposed to individual data points:

GroupLink(C_i, C_j ) =	Link(T_u, T_v).	(7.2)
	T_u∈C_i,T_v ∈C_j

Note that this criterion has a slight resemblance to the group-average linkage criterion discussed in the previous chapter. However, this measure is not yet normalized because the expected number of cross-links between larger clusters is greater. Therefore, one must normalize by the expected number of cross-links between a pair of clusters to ensure that the merging of larger clusters is not unreasonably favored. Therefore, the normalized linkage criterion V (C_i, C_j ) is as follows:

V (

^Ci

^Cj

) =

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 124 125 126 127 128 129 130 131 ... 423