Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	370/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 366 367 368 369 370 371 372 373 ... 423

1-Data Mining tarjima

k

_i₌₁g(i);

if (G_k > 0) then exchange {x₁ . . . x_k} and
{y₁ . . . y_k} between N₁ and N₂;
until (G_k ≤ 0);
return(N₁, N₂);
end

Figure 19.3: The Kernighan–Lin algorithm

k-exchange is referred to as an epoch. The algorithm repeatedly executes such epochs of k-exchanges. If no such k-exchange with positive gain can be found, then the algorithm terminates. The overall algorithm is illustrated in Fig. 19.3.

The Kernighan–Lin algorithm converges rapidly to a local optimum. In fact, a very small number of epochs (fewer than five) may be required for the algorithm to terminate. Of course, there is no guarantee on the required number of epochs, considering that the problem is NP-hard. The running time of each epoch can be amortized to O (m·log(n)) time, where m is the number of edges and n is the number of nodes. Variants of the algorithm have been proposed to speed up the method significantly.

19.3.1.1 Speeding Up Kernighan–Lin

A fast variant of Kernighan–Lin is based on the modifications by Fiduccia and Mattheyses. This version can also handle weights associated with both nodes and edges. Furthermore, the approach allows the specification of the level of balance between the two partitions as a ratio. Instead of pairing nodes in an epoch to swap them, one can simply move a single node i from one partition to the other so that the gain D_i of Eq. 19.14 is as large as possible. Only nodes that can move without violating2 the balancing constraint are considered eligible for a move at each step. After moving node i, it is marked so that it will not be considered again in the current epoch. The values of D_j on the other vertices j ∈ N are updated to reflect this change. This process is repeated until either all nodes have been considered for a move in an epoch or the balancing criterion prevents further moves. The latter is possible

Moving a node from one partition to the other will frequently cause violations unless some flexibility is allowed in the balancing ratio. In practice, a slight relaxation (or small range) of required balancing ratios may be used to ensure feasible solutions.

19.3. COMMUNITY DETECTION

631

when the desired partition ratios are unbalanced, or the nodes do not have unit weights. Note that many potential moves in an epoch might have negative gain. Therefore, as in the original Kernighan–Lin algorithm, only the best partition created during an epoch is made final and the remaining moves are undone. A special data structure was also introduced by Fiduccia and Mattheyses to implement each epoch in O(m) time, where m is the number of edges. In practice, a small number of epochs is usually required for convergence in most real-world networks, although there is no guarantee on the required number of epochs.

While the original improvement of Fiduccia and Mattheyses moves as many vertices as possible in an epoch, it was observed by Karypis and Kumar that it is not necessary to do so. Rather, one can terminate an epoch, if the partitioning objective function does not improve in a predefined number n_p of moves. These n_p moves are then undone, and the epoch terminates. The typical value of n_p chosen is 50. Furthermore, it is not always necessary to move the vertex with the largest possible gain, as long as the gain is positive. Dropping the restriction of finding a vertex with the largest gain improves the per-move cost significantly. The improvements from these simple modifications are significant in many scenarios.

19.3.2 Girvan–Newman Algorithm

This algorithm uses edge lengths c _ij , rather than the edge weights w_ij . The edge lengths may be viewed as in the inverse of the edge weights. In cases, where edge weights are specified, one may heuristically transform them to edge lengths by using c_ij = 1/w_ij , or a suitable application-specific function.

The Girvan–Newman algorithm is based on the intuition that edges with high between-ness have a tendency to connect diﬀerent clusters. For example, the edges that are incident on the hub nodes in Fig. 19.2 have a high betweenness. Their high betweenness is a result of the large number of pairwise shortest paths between nodes of diﬀerent communities pass-ing through these edges. Therefore, the disconnection of these edges will result in a set of connected components that corresponds to the natural clusters in the original graph. This disconnection approach forms the basis of the Girvan–Newman algorithm.

The Girvan–Newman algorithm is a top-down hierarchical clustering algorithm that creates clusters by successively removing edges with the highest betweenness until the graph is disconnected into the required number of connected components. Because each edge removal impacts the betweenness values of some of the other edges, the betweenness values of these edges need to be recomputed after each removal. The Girvan–Newman algorithm is illustrated in Fig. 19.4.

The main challenge in the Girvan–Newman algorithm is the computation of the edge betweenness values. The computation of node betweenness values is an intermediary step in the edge-betweenness computation. Recall that all node and edge-betweenness centrality values are defined as a function of the exhaustive set of shortest paths between all source– sink pairs. These betweenness centrality values can, therefore, be decomposed into several additive components, where each component is defined by the subset of the shortest paths originating from a source node s. To compute these betweenness components, a two-step approach is used for each possible source node s:

The number of shortest paths from the source node s to every other node is computed.

The computations in the first step are used to compute the component B_s(i) of the node betweenness centrality of node i, and the component b_s(i, j) of the edge between-ness centrality of edge (i, j), that correspond to the subset of shortest paths originating from a particular source node s.

632 CHAPTER 19. SOCIAL NETWORK ANALYSIS

Algorithm GirvanNewman(Graph: G = (N, A), Number of Clusters: k,

Edge lengths: [c_ij ])
begin
Compute betweenness value of all edges in graph G;

repeat
Remove edge (i, j) from G with highest betweenness; Recompute betweenness of edges aﬀected by removal of (i, j);

until G has k components remaining;

return connected components of G;

end

Figure 19.4: The Girvan–Newman Algorithm

These source node-specific betweenness centrality components can then be added over all possible source nodes to compute the overall betweenness centrality values.

The first step in the betweenness centrality computation is to create a graph of edges that lie on at least one shortest path from node s to some other node. Such edges are referred to as tight edges for source node s. The betweenness value component of an edge for a particular source node s can be nonzero only if that edge is tight for that source node. The Dijkstra algorithm, described in Sect. 3.5.1.1 of Chap. 3, is used to determine the shortest path distances SP (j) from the source node s to node j. In order for an edge (i, j) to be tight, the following condition has to hold:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 366 367 368 369 370 371 372 373 ... 423