Data Mining: The Textbook

|MCS(G₁, G₂)|

M Dist(G

max{|G₁|, |G₂|}

Yüklə 17,13 Mb.

səhifə	340/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 336 337 338 339 340 341 342 343 ... 423

1-Data Mining tarjima

U(G₁, G₂) = |G₁| + |G₂| − 2 · |MCS(G₁, G₂)|

(17.1)

This is equal to the number of non-matching nodes between the two graphs because it subtracts out the number of matching nodes |M CS(G₁, G₂ )| from each of |G₁| and |G₂| and then adds them up. This measure is unnormalized because the value of the distance depends on the raw size of the underlying graphs. This is not desirable because it is more diﬃcult to compare distances between pairs of graphs of varying size. This measure is more eﬀective when the diﬀerent graphs in the collection are of approximately similar size.

2. Union-normalized distance: The distance measure lies in the range of (0, 1), and is also shown to be a metric. The union-normalized measure U Dist(G₁, G₂) is defined

as follows:					\|MCS(G₁, G₂)\|
U Dist(G	, G	) = 1	−			(17.2)
				\|G₁\| + \|G₂\| − \|MCS(G₁, G₂)\|
1	2

This measure is referred to as the union-normalized distance because the denominator contains the number of nodes in the union of the two graphs. A diﬀerent way of understanding this measure is that it normalizes the number of non-matching nodes U (G₁, G₂) between the two graphs (unnormalized measure) with the number of nodes in the union of the two graphs.

U Dist(G₁, G₂) =	Non-matching nodes between G₁ and G₂
U Dist(G₁, G₂) =	Union size of G₁ and G₂
	Union size of G₁ and G₂

One advantage of this measure is that it is intuitively easier to interpret. Two perfectly matching graphs will have a distance of 0 from one another, and two perfectly non-matching graphs will have a distance of 1.

3. Max-normalized distance: This distance measure also lies in the range (0, 1). The max-normalized distance M Dist(G₁, G₂) between two graphs G₁ and G₂ is defined as

The main diﬀerence from the union-normalized distance is that the denominator is normalized by the maximum size of the two graphs. This distance measure is a met-ric because it satisfies the triangle inequality. The measure is also relatively easy to interpret. Two perfectly matching graphs will have a distance of 0 from one another, and two perfectly non-matching graphs will have a distance of 1.

These distance measures can be computed eﬀectively only for small graphs. For larger graphs, it becomes computationally too expensive to evaluate these measures because of the need to determine the maximum common subgraph between the two graphs.

17.2. MATCHING AND DISTANCE COMPUTATION IN GRAPHS								567
	SOURCE GRAPH
					DESTINATION GRAPH
	C		B			B
	B		B
		NODE		LABEL		B	B
B	B	DELETION ^B		^BSUBSTITUTION		B	B
B	B	DELETION ^B		^BSUBSTITUTION
	A		A			C
	A		A
	G₁					G₂
	G₁
	NODE					EDGE
	DELETION					DELETION
		C		C
	B	EDGE		B
		INSERTION
	B	B	B	B

Figure 17.6: Example of two possible edits paths between graphs G₁ and G₂

17.2.3.2 Graph Edit Distance

The graph edit distance is analogous to the string edit distance, discussed in Chap. 3. The main diﬀerence is that the edit operations are specific to the graph domain. The edit distances can be applied to the nodes, the edges, or the labels. In the context of graphs, the admissible operations include (a) the insertion of nodes, (b) the deletion of nodes, (c) the label-substitution of nodes, (d) the insertion of edges, and (e) the deletion of edges. Note that the deletion of a node includes automatic deletion of all its incident edges. Each edit operation has an edit cost associated with it that is defined in an application-specific manner. In fact, the problem of learning edit costs is a challenging issue in its own right. For example, one way of learning edit costs is to use supervised distance function learning methods discussed in Chap. 3. The bibliographic notes contain pointers to some of these algorithms.

An example of two possible edit paths between graphs G₁ and G₂ is illustrated in Fig. 17.6. Note that the two paths will have diﬀerent costs, depending on the costs of the constituent operations. For example, if the cost of label-substitution is very high compared to that of edge insertions and deletions, it may be more eﬀective to use the second (lower) path in Fig. 17.6. For large and complex pairs of graphs, an exponential number of possible edit paths may exist. The edit distance Edit(G₁, G₂) between two graphs is equal to the minimum cost of transforming graph G₁ to G₂ with a series of edit operations.

Definition 17.2.5 (Graph Edit Distance) The graph edit distance Edit(G₁, G₂) is the minimum cost of the edit operations to be applied to graph G₁ in order to transform it to graph G₂.

Depending on the costs of the diﬀerent operations, the edit distance is not necessarily symmetric. In other words, Edit(G₁, G₂) can be diﬀerent from Edit(G₂, G₁) . Interestingly, the edit distance is closely related to the problem of determining MCGs. In fact, for some special choices of costs, the edit distance can be shown to be equivalent to distance measures based on the maximum common subgraph. This implies that the edit- distance computation for graphs is NP-hard as well. The edit distance can be viewed as the cost of an error-tolerant graph isomorphism, where the “errors” are quantified in terms of the cost of edit operations. As discussed in Chap. 3, the edit-distance computation for strings and sequences can be solved polynomially using dynamic programming. The case of graphs is more diﬃcult because it belongs to the class of NP-hard problems.

568 CHAPTER 17. MINING GRAPH DATA

The close relationship between edit-distance computation and the MCG problem is reflected in the similar structure of their corresponding algorithms. As in the case of the maximum common subgraph problem, a recursive tree-search procedure can be used to compute the edit distance. In the following, the basic procedure for computing the edit distance will be described. The bibliographic notes contain pointers to various enhancements of this procedure.

An interesting property of the edit-distance is that it can be computed by exploring only those edit sequences in which any and all node insertion operations (together with their incident edge insertions) are performed at the end of the edit sequence. Therefore, the edit-distance algorithm maintains a series of edits E that are the operations to be applied to graph G ₁ to transform it into a subgraph isomorphism G₁ of the graph G₂. By trivially adding the unmatched nodes of G₂ to G₁ and corresponding incident edges as the final step, it is possible to create G₂. Therefore, the initial part of sequence E, without the last step, does not contain any node insertions at all. In other words, the initial part of sequence E may contain node deletions, node label -substitutions, edge additions, and edge deletions. An example of such an edit sequence is as follows:

= Delete(i₁), Insert(i₂, i₅), Label-Substitute(i₄, A ⇒ C), Delete(i₂, i₆)

This edit sequence illustrates the deletion of a node, followed by addition of the new edge (i₂, i₅). The label of node i₄ is substituted from A to C. Then, the edge (i₂, i₆) is deleted. The total cost of an edit sequence E from G₁ to a subgraph isomorphism G₁ of G₂ is equal to the sum of the edit costs of all the operations in E, together with the cost of the node insertions and incident edge insertions that need to be performed on G₁ to create the final graph G₂.

The correctness of such an approach relies on the fact that it is always possible to arrange the optimal edit path sequence, so that the insertion of the nodes and their incident edges is performed after all other edge operations, node deletions and label-substitutions that transform G₁ to a subgraph isomorphism of G₂ . The proof of this property follows from the fact that any optimal edit sequence can be reordered to push the insertion of nodes (and their incident edges) to the end, as long as an inserted node is not associated with any other edit operations (node or incident edge deletions, or label-substitutions). It is also easy to show that any edit path in which newly added nodes or edges are deleted will be suboptimal. Furthermore, an inserted node never needs to be label-substituted in an optimal path because the correct label can be set at the time of node insertion.

The overall recursive procedure is illustrated in Fig. 17.7. The inputs to the algorithm are the source and target graphs G₁ and G₂, respectively. In addition, the current edit sequence

being examined for further extension, and the best (lowest cost) edit sequence E_best found so far, are among the input parameters to the algorithm. These input parameters are useful for passing data between recursive calls. The value of E is initialized to null in the top-level call. While the value of E is null at the beginning of the algorithm, new edits are appended to it in each recursive call. Further recursive calls are executed with this extended sequence as the input parameter. The value of the parameter E_best at the top-level call is set to a trivial sequence of edit operations in which all nodes of G₁ are deleted and then all nodes and edges of G₂ are added.

The recursive algorithm first discovers the sequence of edits E that transforms the graph G₁ to a subgraph isomorphism G₁ of G₂. After this phase, the trivial sequence of node/edge insertion edits that convert G₁ to G₂ is padded at the end of E. This step is shown in Fig. 17.7 just before the return condition in the recursive call. Because of this final padding step, the

17.2. MATCHING AND DISTANCE COMPUTATION IN GRAPHS

569

Algorithm EditDistance(Graphs: G₁, G₂, Current Partial Edit Sequence: E,

Best Known Edit Sequence: E_best)
begin
if (G₁ is subgraph isomorphism of G₂) then begin Add insertion edits to E that convert G₁ to G₂; return(E);

end;
C = Set of all possible edits to G₁ excluding node-insertions;

Prune C using heuristic methods; (Optional eﬃciency optimization) for each edit operation e ∈ C do
begin
Apply e to G₁ to create G₁;
Append e to E to create E ;
E_current = EditDistance(G₁, G₂, E , E_best);
if (Cost(E_current) < Cost(E_best)) then E_best = E_current;
endfor
return(E_best);
end

Figure 17.7: Graph edit distance algorithm

cost of these trivial edits is always included in the cost of the edit sequence E, which is denoted by Cost(E).

The overall structure of the algorithm is similar to that of the MCG algorithm of Fig. 17.5. In each recursive call, it is first determined if G₁ is a subgraph isomorphism of G₂. If so, the algorithm immediately returns the current set of edits E after the incor-poration of trivial node or edge insertions that can transform G₁ to G₂. If G₁ is not a subgraph isomorphism of G₂, then the algorithm proceeds to extend the partial edit path E. A set of candidate edits C is determined, which when applied to G₁ might reduce the distance to G₂ . In practice, these candidate edits C are determined heuristically because the problem of knowing the precise impact of an edit on the distance is almost as diﬃcult as that of computing the edit distance. The simplest way of choosing the candidate edits is to consider all possible unit edits excluding node insertions. These candidate edits might be node deletions, label-substitutions and edge operations (both insertions and deletions). For a graph with n nodes, the total number of node-based candidate operations is O(n), and the number of edge-based candidate operations is O(n2). It is possible to heuristically prune many of these candidate edits if it can be immediately determined that such edits can never be part of an optimal edit path. In fact, some of the pruning steps are essential to ensure finite termination of the algorithm. Some key pruning steps are as follows:

An edge insertion cannot be appended to the current partial edit sequence E, if an edge deletion operation between the same pair of nodes already exists in the current partial edit path E. Similarly, an edge which was inserted earlier cannot be deleted. An optimal edit path can never include such pairs of edits with zero net eﬀect. This pruning step is necessary to ensure finite termination.

The label of a node cannot be substituted, if the label-substitution of that node exists in the current partial edit path E. Repetitive label-substitutions of the same node are obviously suboptimal.

570 CHAPTER 17. MINING GRAPH DATA

An edge can be inserted between a pair of nodes in G₁, only if at least one edge exists in G₂ between two nodes with the same labels.

A candidate edit should not be considered, if adding it to E would immediately increase the cost of E beyond that of E_best.
Many other sequencing optimizations are possible for prioritizing between candidate edits. For example, all node deletions can be performed before all label-substitutions. It can be shown that the optimal edit sequence can always be arranged in this way. Similarly, label-substitutions which change the overall distribution of labels closer to the target graph may be considered first. In general, it is possible to associate a “goodness-function” with an edit, which heuristically quantifies its likelihood of finding a good edit path, when included in E. Finding good edit paths early will ensure better pruning performance according to the aforementioned criterion (4).

The main diﬀerence among various recursive search algorithms is to use diﬀerent heuristics for candidate ordering and pruning. Readers are referred to the bibliographic notes at the end of this chapter for pointers to some of these methods. After the pruned candidate edits have been determined, each of these is applied to G₁ to create G₁. The procedure is recursively called with the pair (G₁, G₂), and an augmented edit sequence E . This procedure returns the best edit sequence E_current which has a prefix of E . If the cost of E_current is better than E_best (including trivial post-processing insertion edits for full matching), then E_best is updated to E_current. At the end of the procedure, E_best is returned.

The procedure is guaranteed to terminate because repetitions in node label-substitutions and edge deletions are avoided in E by the pruning steps. Furthermore, the number of nodes in the edited graph is monotonically non-increasing as more edits are appended to E . This is because E does not contain node insertions except at the end of the recursion. For a graph with n nodes, there are at most n₂ non-repeating edge additions and deletions and O(n) node deletions and label-substitutions that can be performed. Therefore, the recursion has a finite depth of O(n2) that is also equal to the maximum length of E. This approach t has exponential complexity in the worst case. Edit distances are generally expensive to compute, unless the underlying graphs are small.

17.3 Transformation-Based Distance Computation

The main problem with the distance measures of the previous section is that they are com-putationally impractical for larger graphs. A number of heuristic and kernel-based methods are used to transform the graphs into a space in which distance computations are more eﬃcient. Interestingly, some of these methods are also qualitatively more eﬀective because of their ability to focus on the relevant portions of the graphs.

17.3.1 Frequent Substructure-Based Transformation and Distance Computation

The intuition underlying this approach is that frequent graph patterns encode key properties of the graph. This is true of many applications. For example, the presence of a benzene ring (see Fig. 17.1) in a chemical compound will typically result in specific properties. Therefore, the properties of a graph can often be described by the presence of specific families of structures in it. This intuition suggests that a meaningful way of semantically describing

17.3. TRANSFORMATION-BASED DISTANCE COMPUTATION

571

the graph is in terms of its family of frequent substructures. Therefore, a transformation approach is used in which a text-like vector-space representation is created from each graph. The steps are as follows:

Apply frequent subgraph mining methods discussed in Sect. 17.4 to discover frequent subgraph patterns in the underlying graphs. This results in a “lexicon” in terms of which the graphs are represented. Unfortunately, the size of this lexicon is rather large, and many subgraphs may be redundant because of similarity to one another.

Select a subset of subgraphs from the subgraphs found in the first step to reduce the overlap among the frequent subgraph patterns. Diﬀerent algorithms may vary in this step by using only frequent maximal subgraphs, or selecting a subset of graphs that are suﬃciently nonoverlapping with one another. Create a new feature f_i for each frequent subgraph S_i that is finally selected. Let d be the total number of frequent subgraphs (features). This is the lexicon size in terms of which a text-like representation will be constructed.

For each graph G_i, create a vector-space representation in terms of the features f₁ . . . f_d. Each graph contains the features, corresponding to the subgraphs that it contains. The frequency of each feature is the number of occurrences of the corre-sponding subgraph in the graph G_i. It is also possible to use a binary representation by only considering presence or absence of subgraphs, rather than frequency of pres-ence. The tf-idf normalization may be used on the vector-space representation, as discussed in Chap. 13.

After the transformation has been performed, any of the text similarity functions can be used to compute distances between graph objects. One advantage of using this approach is that it can be paired up with a conventional text index, such as the inverted index, for eﬃcient retrieval. The bibliographic notes contain pointers to some of these methods.

This broader approach can also be used for feature transformation. Therefore, any data mining algorithm from the text domain can be applied to graphs using this approach. Later, it will be discussed how this transformation approach can be used in a more direct way by graph mining algorithms such as clustering. The main disadvantage of this approach is that subgraph isomorphism is an intermediate step in frequent substructure discovery. Therefore, the approach has exponential complexity in the worst case. Nevertheless, many fast approximations are often used to provide more eﬃcient results without a significant loss in accuracy.

17.3.2 Topological Descriptors

Topological descriptors convert structural graphs to multidimensional data by using quanti-tative measures of important structural characteristics as dimensions. After the conversion has been performed, multidimensional data mining algorithms can be used on the trans-formed representation. This approach enables the use of a wide variety of multidimensional data mining algorithms in graph-based applications. The drawback of the approach is that the structural information is lost. Nevertheless, topological descriptors have been shown to retain important properties of graphs in the chemical domain, and are therefore used quite frequently. In general, the utility of topological descriptors in graph mining is highly domain specific. It should be pointed out that topological descriptors share a number of conceptual similarities with the frequent subgraph approach in the previous section. The

572 CHAPTER 17. MINING GRAPH DATA

Figure 17.8: The Hosoya index for a clique of four nodes

main diﬀerence is that carefully chosen topological parameters are used to define the new feature space instead of frequent subgraphs.

Most topological descriptors are graph specific, whereas a few are node-specific. The vector of node-specific descriptors can sometimes describe the graph quite well. Node specific descriptors can also be used for enriching the labels of the nodes. Some common examples of topological descriptors are as follows:

Morgan index: This is a node-specific index that is equal to the kth order degree of a node. In other words, the descriptor is equal to the number of nodes reachable from the node within a distance of k. This is one of the few descriptors that describes nodes, rather than the complete graph. The node-specific descriptors can also be converted to a graph-specific descriptor by using the frequency histogram of the Morgan index over diﬀerent nodes.

Wiener index: The Wiener index is equal to the sum of the pairwise shortest path distances between all pairs of nodes. It is therefore required to compute the all-pairs shortest path distance between diﬀerent pairs of nodes.

W(G) =	d(i, j)	(17.4)
	i,j∈G

The Wiener index has known relationships with the chemical properties of com-pounds. The motivating reason for this index was the fact that it was known to be closely correlated with the boiling points of alkane molecules [511]. Later, the rela-tionship was also shown for other properties of some families of molecules, such as their density, surface tension, viscosity, and van der Waal surface area. Subsequently, the index has also been used for applications beyond the chemical domain.

Hosoya index: The Hosoya index is equal to the number of valid pairwise node match-ings in the graph. Note that the word “matching” refers to node–node matching within the same graph, rather than graph–graph matching. The matchings do not need to be maximal matchings, and even the empty matching is counted as one of the possibili-ties. The determination of the Hosoya index is #P-complete because an exponential number of possible matchings may exist in a graph, especially when it is dense. For example, as illustrated in Fig. 17.8, the Hosoya index for a complete graph (clique) of only four nodes is 10. The Hosoya index is also referred to as the Z-index.

Estrada index: This index is particularly useful in chemical applications for measuring the degree of protein folding. If λ₁ . . . λ_n are the eigenvalues of the adjacency matrix of graph G, then the Estrada index E(G) is defined as follows:

n
E(G) =e^λi	(17.5)

i=1

17.3. TRANSFORMATION-BASED DISTANCE COMPUTATION

573

Circuit rank: The circuit rank C(G) is equal to the minimum number of edges that need to be removed from a graph in order to remove all cycles. For a graph with m edges, n nodes, and k connected components, this number is equal to (m − n + k). The circuit rank is also referred to as the cyclomatic number. The cyclomatic number provides insights into the connectivity level of the graph.

Randic index: The Randic index is equal to the pairwise sum of bond contributions. If ν_i is the degree of vertex i, then the Randic index R(G) is defined as follows:

R(G) =	1/^√		(17.6)
		ν_i · ν_j
	i,j∈G

The Randic index is also known as the molecular connectivity index. This index is often used in the context of larger organic chemical compounds in order to evaluate their connectivity. The Randic index can be combined with the circuit rank C(G) to yield the Balaban index B(G):

B(G) =	m · R(G)	(17.7)
	C(G) + 1

Here, m is the number of edges in the network.

Most of these indices have been used quite frequently in the chemical domain because of their ability to capture diﬀerent properties of chemical compounds.

17.3.3 Kernel-Based Transformations and Computation

Kernel-based methods can be used for faster similarity computation than is possible with methods such as MCG-based or edit -based measures. Furthermore, these similarity compu-tation methods can be used directly with support vector machine (SVM) classifiers. This is one of the reasons that kernel methods are very popular in graph classification.

Several kernels are used frequently in the context of graph mining. The following contains a discussion of the more common ones. The kernel similarity K(G_i, G_j ) between a pair of graphs G_i and G_j is the dot product of the two graphs after hypothetically transforming them to a new space, defined by the function Φ(·).

K(G_i, G_j ) = Φ(G_i) · Φ(G_j )

(17.8)

In practice, the value of Φ(·) is not defined directly. Rather, it is defined indirectly in terms of the kernel similarity function K(·, ·). There are various ways of defining the kernel similarity.

17.3.3.1 Random Walk Kernels

In random walk kernels, the idea is to compare the label sequences induced by random walks in the two graphs. Intuitively, two graphs are similar if many sequences of labels created by random walks between pairs of nodes are similar as well. The main computational challenge is that there are an exponential number of possible random walks between pairs of nodes. Therefore, the first step is to defined a primitive kernel function k(s₁ , s ₂) between a pair of node sequences s₁ (from G₁) and s₂ (from G₂). The simplest kernel is the identity kernel:

k(s₁, s₂) = I(s₁ = s₂)

(17.9)

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 336 337 338 339 340 341 342 343 ... 423