Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə58/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   54   55   56   57   58   59   60   61   ...   423
1-Data Mining tarjima

SP (s, j) = min{SP (s, j), SP (s, i) + cij }.

(3.22)

86 CHAPTER 3. SIMILARITY AND DISTANCES


This is the essence of the well-known Dijkstra algorithm. This approach is linear in the number of edges in the network, because it examines each node and its incident edges exactly once. The approach provides the distances from a single node to all other nodes in a single pass. The final value of SP (s, j ) provides a quantification of the structural distance between node s and node j. Structural distance -based measures do not leverage the multiplicity in paths between a pair of nodes because they focus only on the raw structural distances.


3.5.1.2 Random Walk-Based Similarity


The structural measure of the previous section does not work well when pairs of nodes have varying numbers of paths between them. For example, in Fig. 3.10, the shortest-path length between nodes A and B is 4, whereas that between A and C is 3. Yet, node B should be considered more similar to A because the two nodes are more tightly connected with a multiplicity of paths. The idea of random walk-based similarity is based on this principle.


In random walk-based similarity, the approach is as follows: Imagine a random walk that starts at source node s, and proceeds to an adjacent node with weighted probability proportional to wij . Furthermore, at any given node, it is allowed to “jump back” to the source node s with a probability referred to as the restart probability. This will result in a probability distribution that is heavily biased toward the source node s . Nodes that are more similar to s will have higher probability of visits. Such an approach will adjust very well to the scenario illustrated in Fig. 3.10 because the walk will visit B more frequently.


The intuition here is the following: If you were lost in a road network and drove randomly, while taking turns randomly, which location are you more likely to reach? You are more likely to reach a location that is close by and can be reached in multiple ways. The random-walk measure therefore provides a result that is different from that of the shortest-path measure because it also accounts for multiplicity in paths during similarity computation.


This similarity computation is closely related to concept of PageRank, which is used to rank pages on the Web by search engines. The corresponding modification for measuring similarity between nodes is also referred to as personalized PageRank, and a symmetric variant is referred to as SimRank. This chapter will not discuss the details of PageRank and SimRank computation, because it requires more background on the notion of ranking. Refer to Sect. 18.4 of Chap. 18, which provides a more complete discussion.


3.5.2 Similarity Between Two Graphs


In many applications, multiple graphs are available, and it is sometimes necessary to deter-mine the distances between multiple graphs. A complicating factor in similarity computation is that many nodes may have the same label, which makes them indistinguishable. Such cases arise often in domains such as chemical compound analysis. Chemical compounds can be represented as graphs where nodes are elements, and bonds are edges. Because an element may be repeated in a molecule, the labels on the nodes are not distinct. Deter-mining a similarity measure on graphs is extremely challenging in this scenario, because even the very special case of determining whether the two graphs are identical is hard. The latter problem is referred to as the graph isomorphism problem, and is known to the NP-hard [221]. Numerous measures, such as the graph-edit distance and substructure-based similarity, have been proposed to address this very difficult case. The core idea in each of these methods is as follows:




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   54   55   56   57   58   59   60   61   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin