Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə382/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   378   379   380   381   382   383   384   385   ...   423
1-Data Mining tarjima

SAYANI

NICOLE













JOHN




























JIM

ALICE

SAYANI

JIM

























PREDICTED LINK




ALICE

BOB













JILL
PETER
TOM



MICHAEL



MARY
MARY
BOB



(a) Many common neighbors

between Alice and Bob


(b) Many indirect connections


between Alice and Bob





Figure 19.12: Examples of varying effectiveness of different link-prediction measures


19.5.2 Katz Measure


While the neighborhood-based measures provide a robust estimation of the likelihood of a link forming between a pair of nodes, they are not quite as effective when the number of shared neighbors between a pair of nodes is small. For example, in the case of Fig. 19.12b, Alice and Bob share one neighbor in common. Alice and Jim also share one neighbor in common. Therefore, neighborhood-based measures have difficulty in distinguishing between different pairwise prediction strengths in these cases. Nevertheless, there also seems to be a significant indirect connectivity in these cases through longer paths. In such cases, walk-based measures are more appropriate. A particular walk-based measure that is used commonly to measure the link-prediction strength is the Katz measure.


Definition 19.5.4 (Katz Measure) Let n(ijt) be the number of walks of length t between nodes i and j. Then, for a user-defined parameter β < 1, the Katz measure between nodes i and j is defined as follows:











Katz(i, j) =

βt · nij(t)

(19.50)




t=1




The value of β is a discount factor that de-emphasizes walks of longer length. For small enough values of β, the infinite summation of Eq. 19.50 will converge. If A is the symmetric adjacency matrix of an undirected network, then the n × n pairwise Katz coefficient matrix K can be computed as follows:










K = (βA)i = (I − βA)1 − I

(19.51)

i=1

The eigenvalues of Ak are the k th powers of the eigenvalues of A (cf. Eq. 19.33). The value of β should always be selected to be smaller than the inverse of the largest eigenvalue of A to ensure convergence of the infinite summation. A weighted version of the measure can



19.5. LINK PREDICTION

653

be computed by replacing A with the weight matrix of the graph. The Katz measure often provides prediction results of excellent quality.


It is noteworthy that the sum of the Katz coefficients of a node i with respect to other nodes is referred to as its Katz centrality. Other mechanisms for measuring centrality, such as closeness and PageRank, are also used for link prediction in a modified form. The reason for this connection between centrality and link-prediction measures is that highly central nodes have the propensity to form links with many nodes.

19.5.3 Random Walk-Based Measures


Random walk-based measures are a different way of defining connectivity between pairs of nodes. Two such measures are PageRank and SimRank. Because these methods are described in detail in Sect. 18.4.1.2 of Chap. 18, they will not be discussed in detail here.


The first way of computing the similarity between nodes i and j is with the use of the personalized PageRank of node j, where the restart is performed at node i. The idea is that if j is the structural proximity of i, it will have a very high personalized PageRank measure, when the restart is performed at node i. This is indicative of higher link prediction strength between nodes i and j. The personalized PageRank is an asymmetric measure between nodes i and j. Because the discussion in this section is for the case of undirected graphs, one can use the average of the values of P ersonalizedP ageRank (i, j) and P ersonalizedP ageRank (j, i). Another possibility is the SimRank measure that is already a symmetric measure. This measure computes an inverse function of the walk length required by two random surfers moving backwards to meet at the same point. The corresponding value is reported as the link prediction measure. Readers are advised to refer to Sect. 18.4.1.2 of Chap. 18 for details of the SimRank computation.


19.5.4 Link Prediction as a Classification Problem

The aforementioned measures are unsupervised heuristics. For a given network, one of these measures might be more effective, whereas another might be more effective for a different network. How can one resolve this dilemma and select the measures that are most effective for a given network?


The link prediction problem can be viewed as a classification problem by treating the presence or absence of a link between a pair of nodes as a binary class indicator. Thus, a multidimensional data record can be extracted for each pair of nodes . The features of this multidimensional record include all the different neighborhood-based, Katz-based, or walk-based similarities between nodes. In addition, a number of other preferential-attachment features, such as node-degrees of each node in the pair, are used. Thus, for each node pair, a multidimensional data record is constructed. The result is a positive-unlabeled classification problem, where node pairs with edges are the positive examples, and the remaining pairs are unlabeled examples. The unlabeled examples can be approximately treated as negative examples for training purposes. Because there are too many negative example pairs in large and sparse networks, only a sample of the negative examples is used. Therefore, the supervised link prediction algorithm works as follows:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   378   379   380   381   382   383   384   385   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin