Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	382/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 378 379 380 381 382 383 384 385 ... 423

1-Data Mining tarjima

PREDICTED LINK ALICE

SAYANI	NICOLE
	NICOLE
	JOHN
	JOHN
	JIM	ALICE	SAYANI	JIM
	JIM

	PREDICTED LINK
ALICE	BOB
ALICE

JILL
PETER
TOM

MICHAEL

MARY
MARY
BOB

(a) Many common neighbors

between Alice and Bob

(b) Many indirect connections

between Alice and Bob

Figure 19.12: Examples of varying eﬀectiveness of diﬀerent link-prediction measures

19.5.2 Katz Measure

While the neighborhood-based measures provide a robust estimation of the likelihood of a link forming between a pair of nodes, they are not quite as eﬀective when the number of shared neighbors between a pair of nodes is small. For example, in the case of Fig. 19.12b, Alice and Bob share one neighbor in common. Alice and Jim also share one neighbor in common. Therefore, neighborhood-based measures have diﬃculty in distinguishing between diﬀerent pairwise prediction strengths in these cases. Nevertheless, there also seems to be a significant indirect connectivity in these cases through longer paths. In such cases, walk-based measures are more appropriate. A particular walk-based measure that is used commonly to measure the link-prediction strength is the Katz measure.

Definition 19.5.4 (Katz Measure) Let n⁽_ij^t⁾ be the number of walks of length t between nodes i and j. Then, for a user-defined parameter β < 1, the Katz measure between nodes i and j is defined as follows:

	∞
Katz(i, j) =	β^t · n_ij⁽^t⁾	(19.50)
	t=1

The value of β is a discount factor that de-emphasizes walks of longer length. For small enough values of β, the infinite summation of Eq. 19.50 will converge. If A is the symmetric adjacency matrix of an undirected network, then the n × n pairwise Katz coeﬃcient matrix K can be computed as follows:

∞
K = (βA)ⁱ = (I − βA)⁻¹ − I	(19.51)

i=1

The eigenvalues of Ak are the k th powers of the eigenvalues of A (cf. Eq. 19.33). The value of β should always be selected to be smaller than the inverse of the largest eigenvalue of A to ensure convergence of the infinite summation. A weighted version of the measure can

19.5. LINK PREDICTION

653

be computed by replacing A with the weight matrix of the graph. The Katz measure often provides prediction results of excellent quality.

It is noteworthy that the sum of the Katz coeﬃcients of a node i with respect to other nodes is referred to as its Katz centrality. Other mechanisms for measuring centrality, such as closeness and PageRank, are also used for link prediction in a modified form. The reason for this connection between centrality and link-prediction measures is that highly central nodes have the propensity to form links with many nodes.

19.5.3 Random Walk-Based Measures

Random walk-based measures are a diﬀerent way of defining connectivity between pairs of nodes. Two such measures are PageRank and SimRank. Because these methods are described in detail in Sect. 18.4.1.2 of Chap. 18, they will not be discussed in detail here.

The first way of computing the similarity between nodes i and j is with the use of the personalized PageRank of node j, where the restart is performed at node i. The idea is that if j is the structural proximity of i, it will have a very high personalized PageRank measure, when the restart is performed at node i. This is indicative of higher link prediction strength between nodes i and j. The personalized PageRank is an asymmetric measure between nodes i and j. Because the discussion in this section is for the case of undirected graphs, one can use the average of the values of P ersonalizedP ageRank (i, j) and P ersonalizedP ageRank (j, i). Another possibility is the SimRank measure that is already a symmetric measure. This measure computes an inverse function of the walk length required by two random surfers moving backwards to meet at the same point. The corresponding value is reported as the link prediction measure. Readers are advised to refer to Sect. 18.4.1.2 of Chap. 18 for details of the SimRank computation.

19.5.4 Link Prediction as a Classification Problem

The aforementioned measures are unsupervised heuristics. For a given network, one of these measures might be more eﬀective, whereas another might be more eﬀective for a diﬀerent network. How can one resolve this dilemma and select the measures that are most eﬀective for a given network?

The link prediction problem can be viewed as a classification problem by treating the presence or absence of a link between a pair of nodes as a binary class indicator. Thus, a multidimensional data record can be extracted for each pair of nodes . The features of this multidimensional record include all the diﬀerent neighborhood-based, Katz-based, or walk-based similarities between nodes. In addition, a number of other preferential-attachment features, such as node-degrees of each node in the pair, are used. Thus, for each node pair, a multidimensional data record is constructed. The result is a positive-unlabeled classification problem, where node pairs with edges are the positive examples, and the remaining pairs are unlabeled examples. The unlabeled examples can be approximately treated as negative examples for training purposes. Because there are too many negative example pairs in large and sparse networks, only a sample of the negative examples is used. Therefore, the supervised link prediction algorithm works as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 378 379 380 381 382 383 384 385 ... 423