Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə347/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   343   344   345   346   347   348   349   350   ...   423
1-Data Mining tarjima

Fg ⇒ c

The notation Fg denotes a frequent substructure, and c is a class label. Many other measures can be used to quantify the strength of the rule instead of the confidence. Examples include the likelihood ratio, or the cost-weighted confidence in the rare class scenario. The likelihood ratio of Fg ⇒ c is the ratio of the fractional support of Fg in the examples containing c, to the fractional support of Fg in examples not containing c . A likelihood ratio greater than one indicates that the rule is highly likely to belong to a particular class. The generic term for these different ways of measuring the class-specific relevance is the rule strength.





  1. In the second phase, the rules are ordered and pruned. The rules are ordered by decreasing strength. Statistical thresholds on the rule strength may be used for pruning rules with low strength. This yields a compact set R of ordered rules that are used for classification.




  1. In the final phase, a default class is set that can be used to classify test instances not covered by any rule in R. The default class is set to the dominant class of the set of training instances not covered by rule set R. A graph is covered by a rule if the

17.7. SUMMARY

585

left-hand side of the rule is a substructure of the graph. In the event that all training instances are covered by rule set R, then the default class is set to the dominant class in the entire training data. In cases where classes are associated with costs, the cost-sensitive weight is used in determining the majority class.


After the training model has been constructed, it can be used for classification as follows. For a given test graph G, the rules that are fired by G are determined. If no rules are fired, then the default class is reported. Let R c(G) be the set of rules fired by G. Note that these different rules may not yield the same prediction for G. Therefore, the conflicting predictions of different rules need to be combined meaningfully. The different criteria that are used to combine the predictions are as follows:





  1. Average strength: The average strength of the rules predicting each class are deter-mined. The class with the average highest strength is reported.




  1. Best rule: The top rule is determined on the basis of the priority order discussed earlier. The class label of this rule is reported.




  1. Top-k average strength: This can be considered a combination of the previous two methods. The average strength of the top-k rules of each class is used to determine the predicted label.

The XRules procedure uses an efficient procedure for frequent substructure discovery, and many other variations for rule quantification. Refer to the bibliographic notes.


17.6.3 Kernel SVMs


Kernel SVMs can construct classifiers with the use of kernel similarity between training and test instances. As discussed in Sect. 10.6.4 of Chap. 10, kernel SVMs do not actually need the feature representation of the data, as long as the kernel -based similarity K(Gi, Gj ) between any pair of graph objects is available. Therefore, the approach is agnostic to the specific data type that is used. The different kinds of graph kernels are discussed in Sect. 17.3.3 of this chapter. Any of these kernels can be used in conjunction with the SVM approach. Refer to Sect. 10.6.4 of Chap. 10 for details on how kernels may be used in conjunction with SVM classifiers.


17.7 Summary

This chapter studies the problem of mining graph data sets. Graph data are a challenging domain for analysis, because of the difficulty in matching two graphs when there are rep-etitions in the underlying labels. This is referred to as graph isomorphism. Most methods for graph matching require exponential time in the worst case. The MCG between a pair of graphs can be used to define distance measures between graphs. The edit-distance measure also uses an algorithm that is closely related to the MCG algorithm. Because of the com-plexity of matching algorithms in graphs, a different approach is to transform the graph database into a simpler text-like representation, in terms of which distance functions are defined. An important class of graph distance functions is graph kernels. They can be used for clustering and classification.


The frequent substructure discovery algorithm is an important building block because it can be leveraged for other graph mining problems such as clustering and classification.


586 CHAPTER 17. MINING GRAPH DATA


The Apriori-like algorithms use either a node-growth strategy, or an edge-growth strategy in order to generate the candidates and corresponding frequent substructures. Most of the clustering and classification algorithms for graph data are based either on distances, or on frequent substructures. The distance-based methods include the k-medoids and spectral methods for clustering. For classification, the distance-based methods include either the k-nearest neighbor method or graph-based semi-supervised methods. Kernel-based SVM can also be considered specialized distance-based methods, in which SVMs are leveraged in conjunction with the similarity between data objects.


Frequent substructure-based methods are used frequently for graph clustering and clas-sification. A generic approach is to transform the graphs into a new feature representation that is similar to text data. Any of the text clustering or classification algorithms can be applied to this representation. Another approach is to directly mine the frequent substruc-tures and use them as representative sets of clusters, or antecedents of discriminative rules. The XProj and XRules algorithms are based on this principle.


17.8 Bibliographic Notes


The problem of graph matching is addressed in surveys in [26]. The Ullman algorithm for graph matching was proposed in [164]. Two other well known methods for graph - matching are VF2 [162] and QuickSI [163]. Other approximate matching methods are discussed in [313, 314, 521]. The proof of NP-hardness of the graph matching problem may be found in [221, 164]. The use of the MCG for defining distance functions was studied in [120]. The relationship between the graph-edit distance and the maximum common subgraph problem is studied in detail in [119]. The graph edit-distance algorithm discussed in the chapter is a simplification of the algorithm presented in [384]. A number of fast algorithms for com-puting the graph edit distance are discussed in [409]. The problem of learning edit costs is studied in [408 ]. The survey by Bunke in [26] also discusses methods for computing the graph edit costs. A description of the use of topological descriptors in drug-design may be found in [236]. The random walk kernel is discussed in [225, 298], and the shortest-path kernel is discussed in [103]. The work in [225] also provides a generic discussion on graph kernels. The work in [42] shows that frequent substructure-based similarity computation can provide robust results in data mining applications.


The node-growth strategy for frequent subgraph mining was proposed by Inokuchi, Washio, an Motoda [282]. The edge-growth strategy was proposed by Kuramochi and Karypis [331]. The gSpan algorithm was proposed by Yan and Han [519] and uses a depth-first approach to build the candidate tree of graph patterns. A method that uses the vertical representation for graph pattern mining is discussed in [276]. The problem of mining fre-quent trees in a forest was addressed in [536]. Surveys on graph clustering and classification may be found in [26]. The XProj algorithm is discussed in [42], and the XRules algorithm is discussed in [540]. Methods for kernel SVM-based classification are discussed in the graph classification chapter by Tsuda in [26].


17.9 Exercises








  1. Consider two graphs that are cliques containing an even number 2 · n nodes. Let exactly half the nodes in each graph belong to labels A and B. What are the total number of isomorphic matchings between the two graphs?


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   343   344   345   346   347   348   349   350   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin