Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	346/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 342 343 344 345 346 347 348 349 ... 423

1-Data Mining tarjima

17.6. GRAPH CLASSIFICATION

583

17.6.1 Distance-Based Methods

Distance-based methods are most appropriate when the sizes of the underlying graphs are small, and the distances can be computed eﬃciently. Nearest neighbor methods and collective classification methods are two of the distance-based methods commonly used for classification. The latter method is a transductive semi- supervised method, in which the training and test instances need to be available at the same time for the classification process. These methods are described in detail below:

Nearest neighbor methods: For each test instance, the k-nearest neighbors are deter-mined. The dominant label from these nearest neighbors is reported as the relevant label. The nearest neighbor method for multidimensional data is described in detail in Sect. 10.8 of Chap. 10. The only modification to the method is the use of a diﬀerent distance function, suited for the graph data type.

Graph-based methods: This is a semi-supervised method, discussed in Sect. 11.6.3 of Chap. 11. In graph-based methods, a higher level neighborhood graph is constructed from the training and test graph objects. It is important not to confuse the notion of a neighborhood graph with that of the original graph objects. The original graph objects correspond to nodes in the neighborhood graph. Each node is connected to its k nearest neighbor objects based on the distance values. This results in a graph containing both labeled and unlabeled nodes. This is the collective classification problem, for which various algorithms are described in Sect. 19.4 of Chap. 19. Collective classification algorithms can be used to derive labels of the nodes in the neighborhood graphs. These derived labels can then be mapped back to the unlabeled graph objects.

Distance-based methods are generally eﬀective when the underlying graph objects are small. For larger graph objects, the computation of distances becomes too expensive. Furthermore, distance computations no longer remain eﬀective from an accuracy perspective, when mul-tiple common substructures are present in the two graphs.

17.6.2 Frequent Substructure-Based Methods

Pattern-based methods extract frequent subgraphs from the data, and use their membership in diﬀerent graphs, in order to build classification models. As in the case of clustering, the main assumption is that the frequently occurring portions of graphs can related to application-specific properties of the graphs. For example, the phenolic acids of Fig. 17.10 are characterized by the two frequent substructures corresponding to the carboxyl group and the phenol group. These substructures therefore characterize important properties of a family or a class of compounds. This is generally true across many diﬀerent applications beyond the chemical domain. As discussed in Sect. 10.4 of Chap. 10, frequent patterns are often used for rule-based classification, even in the “flat” multidimensional domain. As in the case of clustering, either a generic transformational approach or a more direct rule-based method can be used.

17.6.2.1 Generic Transformational Approach

This approach is generally similar to the transformational approach discussed in the previous section on clustering. However, there are a few diﬀerences that account for the impact of supervision. The broad approach may be described as follows:

584 CHAPTER 17. MINING GRAPH DATA

Apply frequent subgraph mining methods discussed in Sect. 17.4 to discover frequent subgraph patterns in the underlying graphs. Select a subset of subgraphs to reduce overlap among the diﬀerent subgraphs. For example, feature selection algorithms that minimize redundancy and maximize the relevance of the features may be used. Such feature selection algorithms are discussed in Sect. 10.2 of Chap. 10. Let d be the total number of frequent subgraphs (features). This is the “lexicon” size in terms of which a text-like representation will be constructed.

For each graph G_i, create a vector-space representation in terms of the d features found. Each graph contains the features corresponding to the subgraphs that it con-tains. The frequency of each feature is equal to the number of occurrences of the corresponding subgraph in graph G_i. It is also possible to use a binary representa-tion by only considering presence or absence of subgraphs, rather than frequency of presence. Use tf-idf normalization on the vector-space representation, as discussed in Chap. 13.

Select any text classification algorithm discussed in Sect. 13.5 of Chap. 13 to build a classification model. Use the model to classify test instances.

This approach provides a flexible framework. After the transformation has been performed, a wide variety of algorithms may be used. It also allows the use of diﬀerent types of supervised feature selection methods to ensure that the most discriminative structures are used for classification.

17.6.2.2 XRules: A Rule-Based Approach

The XRules method was proposed in the context of XML data, but it can be used in the context of any graph database. This is a rule-based approach that relates frequent substructures to the diﬀerent classes. The training phase contains three steps:

In the first phase, frequent substructures with suﬃcient support and confidence are determined. Each rule is of the form:

Yüklə 17,13 Mb.

Dostları ilə paylaş: