Sim(Ti, Tj ) =
|
|Ti ∩ Tj |
|
.
|
(7.1)
|
|
|
|
|Ti ∪ Tj |
|
|
|
Subsequently, two data points Ti and Tj are defined to be neighbors, if the similarity Sim(Ti, T j ) between them is greater than a threshold θ. Thus, the concept of neighbors implicitly defines a graph structure on the data items, where the nodes correspond to
210 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS
the data items, and the links correspond to the neighborhood relations. The notation Link(Ti , T j ) denotes a shared nearest-neighbor similarity function, which is equal to the number of shared nearest neighbors between Ti and Tj .
The similarity function Link(T i, Tj ) provides a merging criterion for agglomerative algo-rithms. The algorithm starts with each data point (from the initially chosen sample) in its own cluster and then hierarchically merges clusters based on a similarity criterion between clusters. Intuitively, two clusters C1 and C2 should be merged, if the cumulative number of shared nearest neighbors between objects in C1 and C2 is large. Therefore, it is possible to generalize the notion of link-based similarity using clusters as arguments, as opposed to individual data points:
GroupLink(Ci, Cj ) =
|
Link(Tu, Tv).
|
(7.2)
|
|
Tu∈Ci,Tv ∈Cj
|
|
Note that this criterion has a slight resemblance to the group-average linkage criterion discussed in the previous chapter. However, this measure is not yet normalized because the expected number of cross-links between larger clusters is greater. Therefore, one must normalize by the expected number of cross-links between a pair of clusters to ensure that the merging of larger clusters is not unreasonably favored. Therefore, the normalized linkage criterion V (Ci, Cj ) is as follows:
Dostları ilə paylaş: |