Data Mining: The Textbook
15.6.3 Rule-Based Methods A major challenge in sequence classification is that many parts of the sequence may be noisy and not very relevant to the class label. In some cases, a short pattern of two symbols may be relevant to classification, whereas in other cases, a longer pattern of many symbols may be discriminative for classification. In some cases, the discriminative patterns may not even occur contiguously. This issue was discussed in the context of timeseries classification in Sect. 14.7.2.1 of Chap. 14. However, discrete sequences can be converted into binary timeseries sequences, with the use of binarization. These binary timeseries can be converted to multidimensional wavelet representations. This is described in detail in Sect. 2.2.2.6, and the description is repeated here for completeness. The first step is to convert the discrete sequence to a set of (binary) time series, where the number of time series in this set is equal to the number of distinct symbols. The second step is to map each of these time series into a multidimensional vector using the wavelet transform. Finally, the features from the different series are combined to create a single multidimensional record. A rule-based classifier is constructed on this multidimensional representation. To convert a sequence to a binary time series, one can create a binary string, in which each position value denotes whether or not a particular symbol is present at a position. For example, consider the following nucleotide sequence drawn on four symbols: ACACACTGTGACTG This series can be converted into the following set of four binary time series corresponding to the symbols A, C, T, and G, respectively. 10101000001000 01010100000100 00000010100010 00000001010001 A wavelet transformation can be applied to each of these series to create a multidimensional set of features. The features from the four different series can be appended to create a single numeric multidimensional record. After a multidimensional representation has been obtained, any rule -based classifier can be utilized. Therefore, the overall approach for data classification is as follows: Generate wavelet representation of each of the N sequences to create N numeric multidimensional representations, as discussed above. Discretize wavelet representation to create categorical representations of the timeseries wavelet transformation. Thus, each categorical attribute value represents a range of numeric values of the wavelet coefficients. Generate a set of rules using any rule-based classifier described in Sect. 10.4 of Chap. 10. The patterns on the left-hand represent the patterns of different granu-larities defined by the combination of wavelet coefficients on the left-hand side. When the rule set has been generated, it can be used to classify arbitrary test sequences by first transforming the test sequence to the same wavelet-based numeric multidimensional representation. This representation is used with the fired rules to perform the classification. 524 CHAPTER 15. MINING DISCRETE SEQUENCES Such methods are discussed in Sect. 10.4 of Chap. 10. It is not difficult to see that this approach is a discrete version of the rule-based classification of time series, as presented in Sect. 14.7.2.1 of Chap. 14. 15.6.4 Kernel Support Vector Machines Kernel support vector machines can construct classifiers with the use of kernel similarity between training and test instances. As discussed in Sect. 10.6.4 of Chap. 10, kernel support vector machines do not need the feature values of the records, as long as the kernel-based similarity K(Yi, Yj ) between any pair of data objects are available. In this case, these data objects are strings. Different kinds of kernels are very popular for string classification. 15.6.4.1 Bag-of-Words Kernel In the bag-of-words kernel, the string is treated as a bag of alphabets, with a frequency equal to the number of alphabets of each type in the string. This can be viewed as the vector-space representation of a string. Note that a text document is also a string, with an alphabet size equal to the lexicon. Therefore, the transformation Φ(·) can be vie wed as almost equivalent to the vector -space transformation for a text document. If V (Yi ) be the vector-space representation of a string, then the kernel similarity is equal to the dot product between the corresponding vector space representations. Φ(Yi) = V (Yi) Yüklə 17,13 Mb. Dostları ilə paylaş: |