Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	367/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 363 364 365 366 367 368 369 370 ... 423

1-Data Mining tarjima

BETWEENNESS 16

7

7	6				HIGHEST		17							6
					BETWEENNESS				16			5


8						CENTRALITY					15
9											14	1		2
	1		2			3		4		5


10

11						HIGHEST					13	4		3
	HIGHEST DEGREE				12				INFLUENCE
							CLOSENESS
	CENTRALITY
						CENTRALITY						SET OF NODE 1

		(a)		centrality illustration									(b) proximity prestige

Figure 19.1: Illustration of centrality and prestige

The notion of closeness centrality is meaningfully defined with respect to undirected and connected networks. The average shortest path distance, starting from node i, is denoted by AvDist(i) and is defined in terms of the pairwise shortest path distances Dist(i, j), between nodes i and j as follows:

	n
AvDist(i) =	_j₌₁Dist(i, j)	(19.8)
AvDist(i) =	n − 1	(19.8)
	n − 1

The closeness centrality is simply the inverse of the average distance of other nodes to node i.

C_C (i) = 1/AvDist(i)

(19.9)

Because the value of AvDist(i) is at least 1, this measure ranges between 0 and 1. In the case of Fig. 19.1a, node 3 has the highest closeness centrality because it has the lowest average distance to other nodes.

A measure known as proximity prestige can be used to measure prestige in directed networks. To compute the proximity prestige of node i, the shortest path distance to node i from all other nodes is computed. Unlike undirected networks, a confounding factor in the computation is that directed paths may not exist from other nodes to node i. For example, no path exists to node 7 in Fig. 19.1b. Therefore, the first step is to determine the set of nodes Influence(i) that can reach node i with a directed path. For example, in the case of the Twitter network, Influence(i) corresponds to all recursively defined followers of node i. An example of an influence set of node 1 is illustrated in Fig. 19.1b. The value of AvDist(i) can now be computed only with respect to the influence set Influence(i).

AvDist(i) =	j∈Influence(i) ^Dist(^{j, i}⁾	(19.10)
	\|Influence(i)\|

Note that distances are computed from node j to i, and not vice versa, because we are computing a prestige measure, rather than a gregariousness measure.

Both the size of the influence set and average distance to the influence set play a role in defining the proximity prestige. While it is tempting to use the inverse of the average distance, as in the previous case, this would not be fair. Nodes that have less influence should be penalized. For example, in Fig. 19.1b, node 6 has the lowest possible distance value of 1 from node 7, which is also the only node it influences. While its low average distance to its influence set suggests high prestige, its small influence set suggests that it

626 CHAPTER 19. SOCIAL NETWORK ANALYSIS

cannot be considered a node with high prestige. To account for this, a multiplicative penalty factor is included in the measure that corresponds to the fractional size of the influence set of node i.

InfluenceFraction(i) = ^|^Influence(ⁱ⁾^| (19.11)
n − 1

Then, the proximity prestige P_P (i) is defined as follows:

P_P (i) =	InfluenceFraction(i)	(19.12)
	AvDist(i)

This value also lies between 0 and 1. Higher values indicate greater prestige. The highest possible proximity prestige value of 1 is realized at the central node of a perfectly star-structured network, with a single central actor and all other actors as its (in-linking) spokes.

In the case of Fig. 19.1b, node 1 has an influence fraction of 4/6, and an average distance of 5/4 from the four nodes that reach it. Therefore, its proximity prestige is 4 ∗ 4/(5 ∗ 6) = 16/30. On the other hand, node 6 has a better average distance of 1 to the only node that reaches it. However, because its influence fraction is only 1/6, its proximity prestige is 1/6 as well. This suggests that node 1 has better proximity prestige than node 6. This matches our earlier stated intuition that node 6 is not a very influential node.

19.2.5.3 Betweenness Centrality

While closeness centrality is based on notions of distances, it does not account for the criticality of the node in terms of the number of shortest paths that pass through it. Such notions of criticality are crucial in determining actors that have the greatest control of the flow of information between other actors in a social network. For example, while node 3 has the highest closeness centrality, it is not as critical to shortest paths between diﬀerent pairs of nodes as node 4 in Fig. 19.1a. Node 4 can be shown to be more critical because it also participates in shortest paths between the pairs of nodes directly incident on it, whereas node 3 does not participate in these pairs. The other pairs are approximately the same in the two cases. Therefore, node 4 controls the flow of information between nodes 12 and 17 that node 3 does not control.

Let q_jk denote the number of shortest paths between nodes j and k. For graphs that are not trees, there will often be more than one shortest path between pairs of nodes. Let q_jk(i) be the number of these pairs that pass through node i. Then, the fraction of pairs f_jk(i) that pass through node i is given by f_jk(i) = q_jk(i)/q_jk. Intuitively, f_jk(i) is a fraction that indicates the level of control that node i has over nodes j and k in terms of regulating the flow of information between them. Then, the betweenness centrality C_B (i) is the average value of this fraction over all ⁿ₂ pairs of nodes.

C_B (i) =	_j^fjk(ⁱ)	(19.13)
C_B (i) =	n	(19.13)
	2

The betweenness centrality also lies between 0 and 1, with higher values indicating better betweenness. Unlike closeness centrality, betweenness centrality can be defined for discon-nected networks as well.

While the aforementioned notion of betweenness centrality is designed for nodes, it can be generalized to edges by using the number of shortest paths passing through an edge (rather than a node). For example, the edges connected to the hub nodes in Fig. 19.2 have high betweenness. Edges that have high betweenness tend to connect nodes from diﬀerent

19.3. COMMUNITY DETECTION

627

clusters in the graph. Therefore, these betweenness concepts are used in many community detection algorithms, such as the Girvan–Newman algorithm. In fact, the computation of node- and edge-betweenness values is described in Sect. 19.4 on the Girvan–Newman algorithm.

19.2.5.4 Rank Centrality and Prestige

The concepts of rank centrality and prestige are defined by random surfer models. The PageRank score can be considered a rank centrality score in undirected networks and a rank prestige score in directed networks. Note that the PageRank scores are components of the largest left eigenvector of the random walk transition matrix of the social network. If the adjacency matrix is directly used instead of the transition matrix to compute the largest eigenvector, the resulting scores are referred to as eigenvector centrality scores. Eigenvector centrality scores are generally less desirable than PageRank scores because of the dispro-portionately large influence of high-degree nodes on the centrality scores of their neighbors.

Because the computation of these scores was already discussed in detail in Chap. 18, it will not be revisited here. The idea here is that a citation of a node by another node (such as a follower in Twitter) is indicative of prestige. Although this is also captured by degree prestige, the latter does not capture the prestige of the nodes incident on it. The PageRank computation can be considered a refined version of degree prestige, where the quality of the nodes incident on a particular node i are used in the computation of its prestige.

19.3 Community Detection

The term “community detection” is an approximate synonym for “clustering” in the context of social network analysis. The clustering of networks and graphs is also sometimes referred to as “graph partitioning” in the traditional work on network analysis. Therefore, the lit-erature in this area is rich and includes work from many diﬀerent fields. Much of the work on graph partitioning precedes the formal study of social network analysis. Nevertheless, it continues to be relevant to the domain of social networks. Community detection is one of the most fundamental problems in social network analysis. The summarization of closely related social groups is, after all, one of the most succinct and easily understandable ways of characterizing social structures.

In the social network domain, network clustering algorithms often have diﬃculty in cleanly separating out diﬀerent clusters because of some natural properties of typical social networks.

Multidimensional clustering methods, such as the distance-based k-means algorithm, cannot be easily generalized to networks. In small-world networks, the distances between diﬀerent pairs of nodes is a small number that cannot provide a suﬃciently fine-grained indicator of similarity. Rather, it is more important to use triadic closure properties of real networks, explicitly or implicitly, in the clustering process.

While social networks usually have distinct community structures, the high-degree hub nodes connect diﬀerent communities, thereby bringing them together. Examples of such hub nodes connecting up diﬀerent communities are illustrated in Fig. 19.2. In this case, the nodes A, B, and C are hubs that connect up diﬀerent communities. In real social networks, the structure may be even more complicated, with some of the high-degree nodes belonging to particular sets of overlapping communities.

628 CHAPTER 19. SOCIAL NETWORK ANALYSIS

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 363 364 365 366 367 368 369 370 ... 423