K(Yi, Yj ) = Φ(Yi) · Φ(Yj ) = V (Yi) · V (Yj )
The main disadvantage of the kernel is that it loses all the positioning information between the alphabets. This can be an effective approach for cases where the alphabet size is large. An example is text, where the alphabet (lexicon) size is of a few hundred thousand words. However, for smaller alphabet sizes, the information loss can be too significant for the resulting classifier to be useful.
15.6.4.2 Spectrum Kernel
The bag-of-words kernel loses all the sequential information in the strings. The spectrum kernel addresses this issue by extracting k-mers from the strings and using them to con-struct the vector-space representation. The simplest spectrum kernel pulls out all k-mers from the strings and builds a vector space representation from them. For example, consider the string ATGCGATGG constructed on the alphabet Σ = {A, C, T, G}. Then, the correspond-ing spectrum representation for k = 3 is as follows:
ATG(2), TGC(1), GCG(1), CGA(1), GAT(1), TGG(1)
The values in the brackets correspond to the frequencies in the vector -space representa-tion. This corresponds to the feature map Φ(·) used to define the kernel similarity.
It is possible to enhance the spectrum kernel further by adding a mismatch neighborhood to the kernel. Thus, instead of adding only the extracted k-mers to the feature map, we add
15.6. SEQUENCE CLASSIFICATION
|
525
|
all the k-mers that are m mismatches away from the k-mer. For example, at a mismatch level of m = 1, the following k-mers are added to the feature map for each instance of ATG:
CTG, GTG, TTG, ACG, AAG, AGG, ATC, ATA, ATT
This procedure is repeated for each element in the k -mer, and each of the neighborhood ele-ments are added to the feature map. is procedure is repeated for each element in the k-mer, and each of the neighborhood elements are added to the feature map. The dot product is performed on this expanded feature map Φ(·). The rationale for adding mismatches is to allow for some noise in the similarity computation. The bag-of-words kernel can be viewed as a special case of the spectrum kernel with k = 1 and no mismatches. The spectrum kernel can be computed efficiently with the use of either the trie or the suffix tree data structure. Pointers to such efficient computational methods are provided in the bibliographic notes. One advantage of spectrum kernels is that they can compute the similarity between two strings in an intuitively appealing way, even when the lengths of the two strings are widely varying.
15.6.4.3 Weighted Degree Kernel
The previous two kernel methods directly define a feature map Φ(·) explicitly that largely ignores the ordering between the different k-mers. The weighted degree kernel directly defines K(Yi, Yj ), without explicitly defining a feature map Φ(·). This approach is in the spirit of exploiting the full power of kernel methods. Consider two strings Yi and Yj of the same length n. Let KM ER(Y i , r, k) represent the k-mer extracted from Yi starting from position r. Then, the weighted degree kernel computes the kernel similarity as the number of times the k -mers of a maximum specified length, in the two strings at exactly correspond-ing positions, match perfectly. Thus, unlike spectrum kernels, k-mers of varying lengths are used, and the contribution of a particular length s is weighted by coefficient βs. In other words, weighted degree kernel of order k is defined as follows:
Dostları ilə paylaş: |