Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	336/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 332 333 334 335 336 337 338 339 ... 423

1-Data Mining tarjima

CORRESPONDENCE BETWEEN NODES E
ISOMORPHISM OF G 1

G₁	DASHED LINES INDICATE			G₁	DASHED LINES INDICATE

CORRESPONDENCE BETWEEN NODES			CORRESPONDENCE BETWEEN NODES
E			G₂	E			G₂

D	A		A	D	A		A
	B		B		B		B
C	A		A	C	A		A
	G₂ IS A SUBGRAPH				G₂ IS A SUBGRAPH
		ISOMORPHISM OF G₁				ISOMORPHISM OF G₁
	(a)				(b)

Figure 17.3: Two possible subgraph isomorphisms between a pair of graphs

The definition of subgraph isomorphism in this section assumes that all edges of the node-induced subgraph of the data graph are present in the query graph. In some applications, such as frequent subgraph mining, a more general definition is used, in which any subset of edges of the node-induced subgraph is also considered a subgraph isomorphism. The more general case can be handled with minor changes to the algorithm in this section. Note that the aforementioned definition allows the subgraph G_q (or G) to be disconnected. However, for practical applications, one is usually interested only in connected subgraph isomorphisms. Examples of two possible subgraph matchings between a pair of nodes are illustrated in Fig. 17.3. The figure also illustrates that there are two diﬀerent ways for one graph to be a subgraph of the other. The problem of exact matching is a special case of subgraph matching. Therefore, the problem of subgraph matching is NP-hard as well.

Lemma 17.2.2 The problem of subgraph matching is NP-hard.

Subgraph matching is often used as a subroutine in applications such as frequent pattern mining. While the subgraph matching problem is a generalization of exact matching, the problem can be generalized even further to that of finding the maximum common subgraph (MCG) between a pair of graphs. This is because the MCG between two graphs is at most equal to the smaller of the two graphs when it is a subgraph of the larger one. The MCG or maximum common isomorphism between a pair of graphs is defined as follows.

Definition 17.2.4 (Maximum Common Subgraph) A MCG between a pair of graphs G₁ = (N₁, A₁) and G₂ = (N₂, A₂) is a graph G₀ = (N₀, A₀) that is a subgraph isomorphism of both G₁ and G₂, and for which the size of the node set N₀ is as large as possible.

Because the MCG problem is a generalization of the graph isomorphism problem, it is NP-hard as well. In this section, algorithms for discovering subgraph isomorphisms and maximum common subgraphs will be presented. Subsequently, the relationship of these algorithms to that of distance computation between graphs will be discussed. Subgraph isomorphism algorithms can be designed to determine either all the subgraph isomorphisms between a query graph and a data graph, or a fast algorithm can be designed to determine whether or not at least one isomorphism exists.

562 CHAPTER 17. MINING GRAPH DATA

17.2.1 Ullman’s Algorithm for Subgraph Isomorphism

Ullman’s algorithm is designed to determine all possible subgraph isomorphisms between a query graph and a data graph. It can also be used for the decision problem of determining whether or not a query graph is a subgraph isomorphism of a data graph by using an early termination criterion. Interestingly, the majority of the later graph matching algorithms are refinements of Ullman’s algorithm. Therefore, this section will first present a very simplified version of Ullman’s algorithm without any refinements. Subsequently, the diﬀerent variations and refinements of this basic algorithm will be discussed in a separate subsection. Although the definition of subgraph isomorphisms allows the query (or data) graph to be disconnected, it is often practical and computationally expedient to focus on cases where the query and data graph are connected. Typically, small changes to the algorithm can accommodate both cases (see Exercise 14).

It will be assumed that the query graph is denoted by G_q = (N_q, A_q), and the data graph is denoted by G = (N, A). The first step in Ullman’s algorithm is to match all possible pairs of nodes across the two graphs so that each node in the pair has the same label as the other. For each such matching pair, the algorithm expands it a node at a time with the use of a recursive search procedure. Each recursive call expands the matching subgraphs in G_q and G by one node. Therefore, one of the parameters of the recursive call is the current matching set M of node -pairs. Each element of M is a pair of matching nodes between G_q and G. Therefore, when a subgraph of m nodes has been matched between the two graphs, the set M contains m matched node-pairs as follows:

= {(i^q₁, i₁), (i^q₂, i₂), . . . (i^q_m, i_m)}

Here, it is assumed that the node iq_r belongs to the query graph G_q and that the node i_r belongs to the data graph G. The value of the matching set parameter M is initialized to the empty set at the top-level recursive call. The number of matched nodes in M is exactly equal to the depth of the recursive call. The recursion backtracks when either the subgraphs cannot be further matched or when G_q has been fully matched. In the latter case, the matching set M is reported, and the recursion backtracks to the next higher level to discover other matchings. In cases where it is not essential to determine all possible matchings between the pair of graphs, it is also possible to terminate the algorithm at this point. This particular exposition, however, assumes that all possible matchings need to be determined.

A simplified version of Ullman’s algorithm is illustrated in Fig. 17.4. The algorithm is structured as a recursive approach that explores the space of all possible matchings between the two graphs. The input to the algorithm is the query graph G_q and the data graph G. An additional parameter M of this recursive call is a set containing the current matching node-pairs. While the set M is empty at the top-level call made by the analyst, this is not the case at lower levels of the recursion. The cardinality of M is exactly equal to the depth of the recursion. This is because one matching node-pair is added to M in each recursive call. Strictly speaking, the recursive call returns all the subgraph isomorphisms under the constraint that the matching corresponding to M must be respected.

The first step of the recursive procedure is to check whether the size of M is equal to the number of nodes in the query graph G_q. If this is indeed the case, then the algorithm reports M as a successful subgraph matching and backtracks out of the recursion to the next higher level to explore other matchings. Otherwise, the algorithm tries to determine further matching node-pairs to add to M. This is the candidate generation step. In this

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 332 333 334 335 336 337 338 339 ... 423