Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə373/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   369   370   371   372   373   374   375   376   ...   423
1-Data Mining tarjima

1

2

3

1
















1































3







4

2

1

4





































3






















5

























3

1

3

1













2







1

3













4






















3

1







3

1































PARTITIONING







A POSSIBLE










INHERITED FROM







PARTITIONING OF







2

COARSENED GRAPH




2 COARSENED GRAPH




(a)

Original graph with




(b)

Coarsened graph with




inherited partition from







partition







coarsened graph
















Figure 19.6: Illustration of coarsening and partitioning inheritance in uncoarsening


The corresponding node weights and edge weights are also illustrated in the same figure. A good partitioning of this smaller coarsened graph maps to an approximate partitioning of the original graph. Therefore, one possible approach is to compress the original graph into a small one by using a coarsening heuristic, then partition this smaller graph more efficiently with any off -the-shelf algorithm, and finally map this partition onto the original graph. An example of mapping a partition on the coarsened graph to the original graph is also illustrated in Fig. 19.6. The resulting partition can be refined with an algorithm, such as the Kernighan–Lin algorithm. The multilevel scheme enhances this basic approach with multiple levels of coarsening and refinement to obtain a good trade-off between quality and efficiency. The multilevel partitioning scheme uses three phases:



  1. Coarsening phase: Carefully chosen sets of nodes in the original graph G = G0 are contracted to create a sequence of successively smaller graphs, G0, G1, G2 . . . Gr. To perform a single step of coarsening from Gm−1 to Gm, small sets of nonoverlapping and tightly interconnected nodes are identified. Each set of tightly interconnected nodes is contracted into a single node. The heuristics for identifying these node sets will be discussed in detail later. The final graph Gr is typically smaller than a 100 nodes. The small size of this final graph is important in the context of the second partitioning phase. The different levels of coarsening created by this phase create important reference points for a later uncoarsening phase.




  1. Partitioning phase: Any off-the-shelf algorithm can be used to create a high-quality balanced partitioning from graph Gr. Examples include the spectral approach of Sect. 19.3.4 and the Kernighan–Lin algorithm. It is much easier to obtain a high-quality partitioning with a small graph. This high-quality partitioning provides a good starting point for refinement during the uncoarsening phase. Even relatively poor par-titionings of this coarsest graph often map to good partitionings on the uncontracted




636

CHAPTER 19. SOCIAL NETWORK ANALYSIS




COARSENING PHASE

UNCOARSENING PHASE




G0

G0










REFINED

ORIGINAL







PARTITION







PARTITION













G1

G1







(r 2) COARSENED GRAPHS

(r 2) UNCOARSENED GRAPHS




Gr 1

Gr 1









Gr


BLACK BOX PARTITIONING


ALGORITHM


INITIAL PARTITIONING PHASE

Figure 19.7: The multilevel framework [301] of METIS


graph, because the collapsed edges during coarsening are not eligible to be cut during this phase.



  1. Uncoarsening phase (refinement): In this phase, the graphs are expanded back to their successively larger versions Gr, Gr−1 . . . G0. Whenever the graph Gm is expanded to Gm−1, the latter inherits the partitioning from Gm. This inheritance is illustrated in Fig. 19.6. The fast variant of the Kernighan–Lin scheme, discussed in Sect. 19.3.1.1, is applied to this partitioning of Gm−1 to refine it further before expanding it further to Gm−2. Therefore, graph Gm−2 inherits the refined partition from Gm−1. Usually, the refinement phase is extremely fast because the KL-algorithm starts with a very high quality approximate partition of Gm−1.

A pictorial representation of the multilevel scheme, based on an illustration in [301], is provided in Fig. 19.7. Note that the second and third phases use off-the-shelf schemes that are discussed in other parts of this chapter. Therefore, the following discussion will focus only on the first phase of coarsening.


A number of techniques are used for coarsening with varying levels of complexity. In the following, a few simple schemes are described that coarsen only by matching pairs of nodes in a given phase. In order for a pair of nodes to be matched, they must always be connected with an edge. The coarsened graph will be at least half the size of the original graph in terms of the number of nodes. In spite of the simplicity of these coarsening methods, these schemes turn out to be surprisingly effective in the context of the overall clustering algorithm.





  1. Random edge matching: A node i is selected at random and matched to an adjacently connected unmatched node that is also selected randomly. If no such unmatched node exists, then the vertex remains unmatched. The matching is performed, until no (adja-cent) unmatched pair remains in the graph.




  1. Heavy edge matching: As in random edge matching, a node i is selected at random and matched to an adjacently connected unmatched node. However, the difference is that the largest weight incident edge (i, j) is used to select the unmatched node j.




19.3. COMMUNITY DETECTION

637

The intuition is that it is better to contract heavy edges because they are less likely to be part of an optimal partitioning.





  1. Heavy clique matching: The contraction of densely connected sets of nodes in the graph will maximize the number of collapsed edges. This method tracks the weight vi of node i, which corresponds to the number of contracted nodes it represents. Furthermore, the notation si denotes the sum of the weights of the collapsed edges at node i (or its precursors) in previous contraction phases. Note that if the contracted node i represents a clique in the original graph, then si will approach vi · (vi 1)/2. Because it is desirable to contract dense components, one must try to ensure that the value of si resulting from the contraction approaches its upper limit. This is achieved by computing the edge density μij (0, 1) of edge (i, j):

μij =

2 · (si + sj + wij )

(19.23)




(vi + vj ) · (vi + vj 1)













When nodes across high-density edges are contracted, they typically correspond to cliques in the original graph G = G0, if it was unweighted. Even for weighted graphs, the use of high-edge density is generally quite effective. The nodes of the graph are vis-ited in random order. For each node, its highest density unmatched neighbor is selected for matching. Unlike heavy edge matching, the heavy clique matching approach is not myopic to the contractions that have occurred in previous phases of the algorithm.


The multilevel scheme is effective because of its hierarchical approach, where the early clustering of coarsened graphs ensures a good initial global structure to the bisection. In other words, key components of the graph are assigned to the appropriate partitions early on, in the form of coarsened nodes. This partition is then successively improved in refinement phases. Such an approach avoids local optima more effectively because of its “big picture” approach to clustering.


19.3.4 Spectral Clustering

It is assumed that the nodes are unweighted, though the edge (i, j) is associated with the weight wij . The n ×n matrix of weights is denoted by W . The spectral method uses a graph embedding approach, so that the local clustering structure of the network is preserved by the embedding of the nodes into multidimensional space. The idea is to create a multidi-mensional representation of the graph so that a standard k-means algorithm can be used on the transformed representation.


The simpler problem of mapping the nodes onto a 1-dimensional space will be discussed first. The generalization to the k-dimensional case is relatively straightforward. We would like to map the nodes in N into a set of 1-dimensional real values y1 . . . yn on a line, so that the distances between these points reflect the connectivity among the nodes. Therefore, it is undesirable for nodes that are connected with high-weight edges to be mapped onto distant points on this line. This can be achieved by determining values of yi, for which the following objective function O is minimized:





n

n




O =

wij · (yi − yj )2

(19.24)

i=1 j=1

This objective function penalizes the distances between yi and yj with weight proportional to wij . Therefore, when wij is very large, the data points yi and yj will be more likely to


638 CHAPTER 19. SOCIAL NETWORK ANALYSIS


be closer to one another in the embedded space. The objective function O can be rewritten in terms of the Laplacian matrix L of weight matrix W . The Laplacian matrix L is defined as Λ − W , where Λ is a diagonal matrix satisfying Λii = n wij . Let the n-dimensional


j=1
column vector of embedded values be denoted by y = (y1 . . . yn)T . It can be shown after some algebraic rearrangement of Eq. 19.24, that the objective function O can be rewritten in terms of the Laplacian matrix:



O = 2




T Ly

(19.25)




y




The matrix L is positive semidefinite with nonnegative eigenvalues because the sum-of-squares objective function O is always nonnegative. We need to incorporate a scaling con-straint to ensure that the trivial value of yi = 0 for all i is not selected by the optimization solution. A possible scaling constraint is as follows:








T Λ




= 1

(19.26)




y

y




The matrix Λ is incorporated in the constraint of Eq. 19.26 to achieve normalization, so that the resulting clusters are more balanced across partitions. If Λ is not used in the constraint, the result is referred to as unnormalized spectral clustering. In practice, the effect of this normalization is that low-degree nodes tend to clearly “pick sides” with either large positive or large negative values of yi, whereas very high-degree nodes, which might also be hub nodes, will be embedded closer to central regions near the origin (see Exercise 7). Note that each diagonal entry Λii, which is the sum of the weights of the edges incident on node i, can be viewed as the local density of the network at node i. It can also be shown that incorporating Λ in the constraint approximates an unnormalized embedding in which edge weights wij = wji have been divided by the geometric average Λii · Λjj of the local densities at their end points (see Exercise 8). As discussed in Chap. 3, normalizing distance or similarity values with local densities is often helpful in obtaining high-quality results that are more accurate in their local context.


This constrained optimization formulation can be solved by setting the gradient of its Lagrangian relaxation yT Ly − λ(yT Λy − 1) to 0. It can be shown that the resulting opti-mization condition is Λ1Ly = λy where λ is the Lagrangian parameter. In other words, y is an eigenvector of Λ 1L and λ is an eigenvalue. Furthermore, this optimization condition can be used to easily show that the objective function O = 2yT Ly evaluates to twice the eigenvalue λ for an eigenvector y satisfying this condition. Therefore, among the alternative eigenvector solutions for y, the optimal solution is the smallest nontrivial eigenvector of the normalized Laplacian Λ1 L. The smallest eigenvalue of Λ1L is always 0, and it corre-sponds to the trivial solution where the node embedding y is proportional to (1, 1, . . . 1)T . Such a trivial 1-dimensional embedding corresponds to mapping every node to the same point. This trivial eigenvector is noninformative. Therefore, it can be discarded, and it is not used in the analysis. The second smallest eigenvector then provides an optimal solution that is more informative.


This model can be generalized to a k-dimensional embedding by setting up an analogous optimization formulation with its decision variables as an n × k matrix Y with k column vectors Y = [y1 . . . yk] representing each dimension of the embedding. This optimization formulation minimizes the trace of the k × k matrix Y T LY subject to the normalization constraints Y T ΛY = I. Because of the presence of Λ in the constraint, the columns of Y will not necessarily be orthogonal. The optimal solutions for these k column vectors can be shown to be proportional to the successive directions corresponding to the (not necessarily orthogonal) right eigenvectors of the asymmetric matrix Λ1L with increasing eigenvalues. After discarding the first trivial eigenvector e1 with eigenvalue λ1 = 0, this results in a set





19.3. COMMUNITY DETECTION

639

of k eigenvectors e2, e3 . . . ek+1, with corresponding eigenvalues λ2 ≤ λ3 . . . ≤ λk+1. Because





  • eigenvectors were selected, this approach creates an n × k matrix Dk = Y , corresponding to a k-dimensional embedding of each of the n nodes. Note that even though the normal-ization constraint Y T ΛY = I will not result in columns of Dk having an L2-norm of 1, each column (eigenvector) of Dk is scaled to an L2-norm of 1 as a post-processing4 step. Because of this column scaling, the n × k matrix Dk does not exactly reflect the original optimization formulation in terms of Y . The resulting k-dimensional embedding preserves the clustering structure of the nodes because the optimization formulation of Y tries to minimize distances between highly connected nodes. Therefore, any multidimensional clus-tering algorithm discussed in Chap. 6, such as k-means, can be applied to this embedded representation to generate the clusters on the nodes. This formulation is also sometimes referred to as the random walk version of spectral clustering because of an interpretation in terms of random walks. It is noteworthy that the small eigenvectors of the normalized Laplacian Λ1L are the same as the large eigenvectors of the stochastic transition matrix Λ1W (see Exercise 15).

An equivalent way of setting up the spectral clustering model is to use the related vector of decision variables z = Λy in the optimization formulation of Eqs. 19.25 and 19.26. This related version is referred to as the symmetric version of the spectral clustering model, although it is different from the random walk version only in terms of the scaling of decision variables. By setting z = Λy, it can be shown that the earlier formulation is equivalent to optimizing zT Λ1/2LΛ1/2z subject to zT z = 1. We determine the smallest k (orthogonal) eigenvectors of the symmetric normalized Laplacian Λ1/2LΛ1/2, excluding the first. Each eigenvector of this matrix can also be (proportionally) obtained by pre-multiplying the aforementioned solution Y of the random walk formulation with the diagonal matrix Λ. This relationship also reflects the relationship between z and y. The eigenvalues are the same in both cases. For example, the first eigenvector with eigenvalue 0 will no longer be a vector of 1s, but the various entries will be proportional to (Λ11 . . . Λnn)T . Because of this differential scaling of various nodes, high-degree nodes will tend to have larger (absolute) coordinate values in the symmetric version. By selecting the smallest k eigenvectors, one can generate an n × k multidimensional representation Dk of the entire set of n nodes. Just as the random walk version scales each column of Dk to unit norm in the final step, the symmetric version scales each row of Dk to unit norm. The final step of row scaling is a heuristic enhancement to adjust for the differential scaling of nodes with various degrees, and it is not a part of the optimization formulation. Interestingly, even if the rows of the random walk solution Y had been scaled to unit norm (instead of scaling the columns to unit norm), exactly the same solution would be obtained as that obtained by scaling the rows of the symmetric solution Z to unit norm (see Exercise 13).


Although the two different ways of performing spectral clustering are equivalent in terms of the optimization problem solved, there are differences in terms of the heuristic scaling adjustments. The scaling relationships are illustrated in Fig. 19.8. It is evident from Fig. 19.8 that the main practical difference between the two methods is regulated only by the heuristic scaling used in the final phase, rather than their respective optimization models. Because of the scaling variations, the clusters obtained in the two cases will not be exactly the same. The relative quality will depend on the data set at hand. These optimization problems can also be understood as linear programming relaxations of integer-programming formula-tions of balanced minimum cut problems. However, the minimum-cut explanation does not








  • In practice, the unit eigenvectors of Λ1L can be directly computed, and therefore an explicit post-processing step is not required.

640 CHAPTER 19. SOCIAL NETWORK ANALYSIS






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   369   370   371   372   373   374   375   376   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin