Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	373/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 369 370 371 372 373 374 375 376 ... 423

1-Data Mining tarjima

1	2		3		1

	1
								3
4	2		1		4

								3
										5

3	1		3		1
	2							1		3
						4

	3		1					3		1

			PARTITIONING							A POSSIBLE
			INHERITED FROM							PARTITIONING OF
		2		COARSENED GRAPH					₂COARSENED GRAPH
(a)		Original graph with					(b)		Coarsened graph with
inherited partition from									partition
	coarsened graph

Figure 19.6: Illustration of coarsening and partitioning inheritance in uncoarsening

The corresponding node weights and edge weights are also illustrated in the same figure. A good partitioning of this smaller coarsened graph maps to an approximate partitioning of the original graph. Therefore, one possible approach is to compress the original graph into a small one by using a coarsening heuristic, then partition this smaller graph more eﬃciently with any oﬀ -the-shelf algorithm, and finally map this partition onto the original graph. An example of mapping a partition on the coarsened graph to the original graph is also illustrated in Fig. 19.6. The resulting partition can be refined with an algorithm, such as the Kernighan–Lin algorithm. The multilevel scheme enhances this basic approach with multiple levels of coarsening and refinement to obtain a good trade-oﬀ between quality and eﬃciency. The multilevel partitioning scheme uses three phases:

Coarsening phase: Carefully chosen sets of nodes in the original graph G = G₀ are contracted to create a sequence of successively smaller graphs, G₀, G₁, G₂ . . . G_r. To perform a single step of coarsening from G_m−₁ to G_m, small sets of nonoverlapping and tightly interconnected nodes are identified. Each set of tightly interconnected nodes is contracted into a single node. The heuristics for identifying these node sets will be discussed in detail later. The final graph G_r is typically smaller than a 100 nodes. The small size of this final graph is important in the context of the second partitioning phase. The diﬀerent levels of coarsening created by this phase create important reference points for a later uncoarsening phase.

Partitioning phase: Any oﬀ-the-shelf algorithm can be used to create a high-quality balanced partitioning from graph G_r. Examples include the spectral approach of Sect. 19.3.4 and the Kernighan–Lin algorithm. It is much easier to obtain a high-quality partitioning with a small graph. This high-quality partitioning provides a good starting point for refinement during the uncoarsening phase. Even relatively poor par-titionings of this coarsest graph often map to good partitionings on the uncontracted

636		CHAPTER 19. SOCIAL NETWORK ANALYSIS
COARSENING PHASE		UNCOARSENING PHASE
^G0	^G0
	REFINED		ORIGINAL
	PARTITION
			PARTITION

^G1	^G1
(r 2) COARSENED GRAPHS		(r 2) UNCOARSENED GRAPHS
^Gr 1	^Gr 1

G_r

BLACK BOX PARTITIONING

ALGORITHM

INITIAL PARTITIONING PHASE

Figure 19.7: The multilevel framework [301] of METIS

graph, because the collapsed edges during coarsening are not eligible to be cut during this phase.

Uncoarsening phase (refinement): In this phase, the graphs are expanded back to their successively larger versions G_r, G_r−₁ . . . G₀. Whenever the graph G_m is expanded to G_m−₁, the latter inherits the partitioning from G_m. This inheritance is illustrated in Fig. 19.6. The fast variant of the Kernighan–Lin scheme, discussed in Sect. 19.3.1.1, is applied to this partitioning of G_m−₁ to refine it further before expanding it further to G_m−₂. Therefore, graph G_m−₂ inherits the refined partition from G_m−₁. Usually, the refinement phase is extremely fast because the KL-algorithm starts with a very high quality approximate partition of G_m−₁.

A pictorial representation of the multilevel scheme, based on an illustration in [301], is provided in Fig. 19.7. Note that the second and third phases use oﬀ-the-shelf schemes that are discussed in other parts of this chapter. Therefore, the following discussion will focus only on the first phase of coarsening.

A number of techniques are used for coarsening with varying levels of complexity. In the following, a few simple schemes are described that coarsen only by matching pairs of nodes in a given phase. In order for a pair of nodes to be matched, they must always be connected with an edge. The coarsened graph will be at least half the size of the original graph in terms of the number of nodes. In spite of the simplicity of these coarsening methods, these schemes turn out to be surprisingly eﬀective in the context of the overall clustering algorithm.

Random edge matching: A node i is selected at random and matched to an adjacently connected unmatched node that is also selected randomly. If no such unmatched node exists, then the vertex remains unmatched. The matching is performed, until no (adja-cent) unmatched pair remains in the graph.

Heavy edge matching: As in random edge matching, a node i is selected at random and matched to an adjacently connected unmatched node. However, the diﬀerence is that the largest weight incident edge (i, j) is used to select the unmatched node j.

19.3. COMMUNITY DETECTION

637

The intuition is that it is better to contract heavy edges because they are less likely to be part of an optimal partitioning.

Heavy clique matching: The contraction of densely connected sets of nodes in the graph will maximize the number of collapsed edges. This method tracks the weight v_i of node i, which corresponds to the number of contracted nodes it represents. Furthermore, the notation s_i denotes the sum of the weights of the collapsed edges at node i (or its precursors) in previous contraction phases. Note that if the contracted node i represents a clique in the original graph, then s_i will approach v_i · (v_i − 1)/2. Because it is desirable to contract dense components, one must try to ensure that the value of s_i resulting from the contraction approaches its upper limit. This is achieved by computing the edge density μ_ij ∈ (0, 1) of edge (i, j):

μ_ij =	2 · (s_i + s_j + w_ij )	(19.23)
	(v_i + v_j ) · (v_i + v_j − 1)

When nodes across high-density edges are contracted, they typically correspond to cliques in the original graph G = G₀, if it was unweighted. Even for weighted graphs, the use of high-edge density is generally quite eﬀective. The nodes of the graph are vis-ited in random order. For each node, its highest density unmatched neighbor is selected for matching. Unlike heavy edge matching, the heavy clique matching approach is not myopic to the contractions that have occurred in previous phases of the algorithm.

The multilevel scheme is eﬀective because of its hierarchical approach, where the early clustering of coarsened graphs ensures a good initial global structure to the bisection. In other words, key components of the graph are assigned to the appropriate partitions early on, in the form of coarsened nodes. This partition is then successively improved in refinement phases. Such an approach avoids local optima more eﬀectively because of its “big picture” approach to clustering.

19.3.4 Spectral Clustering

It is assumed that the nodes are unweighted, though the edge (i, j) is associated with the weight w_ij . The n ×n matrix of weights is denoted by W . The spectral method uses a graph embedding approach, so that the local clustering structure of the network is preserved by the embedding of the nodes into multidimensional space. The idea is to create a multidi-mensional representation of the graph so that a standard k-means algorithm can be used on the transformed representation.

The simpler problem of mapping the nodes onto a 1-dimensional space will be discussed first. The generalization to the k-dimensional case is relatively straightforward. We would like to map the nodes in N into a set of 1-dimensional real values y₁ . . . y_n on a line, so that the distances between these points reflect the connectivity among the nodes. Therefore, it is undesirable for nodes that are connected with high-weight edges to be mapped onto distant points on this line. This can be achieved by determining values of y_i, for which the following objective function O is minimized:

n	n
O =	w_ij · (y_i − y_j )²	(19.24)

i=1 j=1

This objective function penalizes the distances between y_i and y_j with weight proportional to w_ij . Therefore, when w_ij is very large, the data points y_i and y_j will be more likely to

638 CHAPTER 19. SOCIAL NETWORK ANALYSIS

be closer to one another in the embedded space. The objective function O can be rewritten in terms of the Laplacian matrix L of weight matrix W . The Laplacian matrix L is defined as Λ − W , where Λ is a diagonal matrix satisfying Λii = ⁿ wij . Let the n-dimensional

j=1
column vector of embedded values be denoted by y = (y₁ . . . y_n)T . It can be shown after some algebraic rearrangement of Eq. 19.24, that the objective function O can be rewritten in terms of the Laplacian matrix:

O = 2		^TLy	(19.25)
	y

The matrix L is positive semidefinite with nonnegative eigenvalues because the sum-of-squares objective function O is always nonnegative. We need to incorporate a scaling con-straint to ensure that the trivial value of y_i = 0 for all i is not selected by the optimization solution. A possible scaling constraint is as follows:

	T _Λ		= 1	(19.26)
y		y

The matrix Λ is incorporated in the constraint of Eq. 19.26 to achieve normalization, so that the resulting clusters are more balanced across partitions. If Λ is not used in the constraint, the result is referred to as unnormalized spectral clustering. In practice, the eﬀect of this normalization is that low-degree nodes tend to clearly “pick sides” with either large positive or large negative values of y_i, whereas very high-degree nodes, which might also be hub nodes, will be embedded closer to central regions near the origin (see Exercise 7). Note that each diagonal entry Λ_ii, which is the sum of the weights of the edges incident on node i, can be viewed as the local density of the network at node i. It can also be shown that incorporating Λ in the constraint approximates an unnormalized embedding in which edge weights w_ij = w_ji have been divided by the geometric average Λ_ii · Λ_jj of the local densities at their end points (see Exercise 8). As discussed in Chap. 3, normalizing distance or similarity values with local densities is often helpful in obtaining high-quality results that are more accurate in their local context.

This constrained optimization formulation can be solved by setting the gradient of its Lagrangian relaxation y^T Ly − λ(y^T Λy − 1) to 0. It can be shown that the resulting opti-mization condition is Λ−1Ly = λy where λ is the Lagrangian parameter. In other words, y is an eigenvector of Λ −1L and λ is an eigenvalue. Furthermore, this optimization condition can be used to easily show that the objective function O = 2yT Ly evaluates to twice the eigenvalue λ for an eigenvector y satisfying this condition. Therefore, among the alternative eigenvector solutions for y, the optimal solution is the smallest nontrivial eigenvector of the normalized Laplacian Λ−1 L. The smallest eigenvalue of Λ−1L is always 0, and it corre-sponds to the trivial solution where the node embedding y is proportional to (1, 1, . . . 1)T . Such a trivial 1-dimensional embedding corresponds to mapping every node to the same point. This trivial eigenvector is noninformative. Therefore, it can be discarded, and it is not used in the analysis. The second smallest eigenvector then provides an optimal solution that is more informative.

This model can be generalized to a k-dimensional embedding by setting up an analogous optimization formulation with its decision variables as an n × k matrix Y with k column vectors Y = [y₁ . . . y_k] representing each dimension of the embedding. This optimization formulation minimizes the trace of the k × k matrix Y T LY subject to the normalization constraints Y T ΛY = I. Because of the presence of Λ in the constraint, the columns of Y will not necessarily be orthogonal. The optimal solutions for these k column vectors can be shown to be proportional to the successive directions corresponding to the (not necessarily orthogonal) right eigenvectors of the asymmetric matrix Λ−1L with increasing eigenvalues. After discarding the first trivial eigenvector e₁ with eigenvalue λ₁ = 0, this results in a set

19.3. COMMUNITY DETECTION

639

of k eigenvectors e₂, e₃ . . . e_k₊₁, with corresponding eigenvalues λ₂ ≤ λ₃ . . . ≤ λ_k₊₁. Because

eigenvectors were selected, this approach creates an n × k matrix D_k = Y , corresponding to a k-dimensional embedding of each of the n nodes. Note that even though the normal-ization constraint Y T ΛY = I will not result in columns of D_k having an L₂-norm of 1, each column (eigenvector) of D_k is scaled to an L₂-norm of 1 as a post-processing4 step. Because of this column scaling, the n × k matrix D_k does not exactly reflect the original optimization formulation in terms of Y . The resulting k-dimensional embedding preserves the clustering structure of the nodes because the optimization formulation of Y tries to minimize distances between highly connected nodes. Therefore, any multidimensional clus-tering algorithm discussed in Chap. 6, such as k-means, can be applied to this embedded representation to generate the clusters on the nodes. This formulation is also sometimes referred to as the random walk version of spectral clustering because of an interpretation in terms of random walks. It is noteworthy that the small eigenvectors of the normalized Laplacian Λ−1L are the same as the large eigenvectors of the stochastic transition matrix Λ−1W (see Exercise 15).

An equivalent way of setting up the spectral clustering model is to use the related vector of decision variables z = ^√Λy in the optimization formulation of Eqs. 19.25 and 19.26. This related version is referred to as the symmetric version of the spectral clustering model, although it is diﬀerent from the random walk version only in terms of the scaling of decision variables. By setting z = ^√Λy, it can be shown that the earlier formulation is equivalent to optimizing zT Λ−1/2LΛ−1/2z subject to zT z = 1. We determine the smallest k (orthogonal) eigenvectors of the symmetric normalized Laplacian Λ⁻¹^/²LΛ⁻¹^/², excluding the first. Each eigenvector of this matrix can also be (proportionally) obtained by pre-multiplying the aforementioned solution Y of the random walk formulation with the diagonal matrix ^√Λ. This relationship also reflects the relationship between z and y. The eigenvalues are the same in both cases. For example, the first eigenvector with eigenvalue 0 will no longer be a vector of 1s, but the various entries will be proportional to (^√Λ₁₁ . . . ^√Λ_nn)T . Because of this diﬀerential scaling of various nodes, high-degree nodes will tend to have larger (absolute) coordinate values in the symmetric version. By selecting the smallest k eigenvectors, one can generate an n × k multidimensional representation D_k of the entire set of n nodes. Just as the random walk version scales each column of D_k to unit norm in the final step, the symmetric version scales each row of D_k to unit norm. The final step of row scaling is a heuristic enhancement to adjust for the diﬀerential scaling of nodes with various degrees, and it is not a part of the optimization formulation. Interestingly, even if the rows of the random walk solution Y had been scaled to unit norm (instead of scaling the columns to unit norm), exactly the same solution would be obtained as that obtained by scaling the rows of the symmetric solution Z to unit norm (see Exercise 13).

Although the two diﬀerent ways of performing spectral clustering are equivalent in terms of the optimization problem solved, there are diﬀerences in terms of the heuristic scaling adjustments. The scaling relationships are illustrated in Fig. 19.8. It is evident from Fig. 19.8 that the main practical diﬀerence between the two methods is regulated only by the heuristic scaling used in the final phase, rather than their respective optimization models. Because of the scaling variations, the clusters obtained in the two cases will not be exactly the same. The relative quality will depend on the data set at hand. These optimization problems can also be understood as linear programming relaxations of integer-programming formula-tions of balanced minimum cut problems. However, the minimum-cut explanation does not

In practice, the unit eigenvectors of Λ⁻¹L can be directly computed, and therefore an explicit post-processing step is not required.

640 CHAPTER 19. SOCIAL NETWORK ANALYSIS

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 369 370 371 372 373 374 375 376 ... 423