Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə319/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   315   316   317   318   319   320   321   322   ...   423
1-Data Mining tarjima

K(Yi, Yj ) = Φ(Yi) · Φ(Yj ) = V (Yi) · V (Yj )

The main disadvantage of the kernel is that it loses all the positioning information between the alphabets. This can be an effective approach for cases where the alphabet size is large. An example is text, where the alphabet (lexicon) size is of a few hundred thousand words. However, for smaller alphabet sizes, the information loss can be too significant for the resulting classifier to be useful.


15.6.4.2 Spectrum Kernel


The bag-of-words kernel loses all the sequential information in the strings. The spectrum kernel addresses this issue by extracting k-mers from the strings and using them to con-struct the vector-space representation. The simplest spectrum kernel pulls out all k-mers from the strings and builds a vector space representation from them. For example, consider the string ATGCGATGG constructed on the alphabet Σ = {A, C, T, G}. Then, the correspond-ing spectrum representation for k = 3 is as follows:


ATG(2), TGC(1), GCG(1), CGA(1), GAT(1), TGG(1)


The values in the brackets correspond to the frequencies in the vector -space representa-tion. This corresponds to the feature map Φ(·) used to define the kernel similarity.


It is possible to enhance the spectrum kernel further by adding a mismatch neighborhood to the kernel. Thus, instead of adding only the extracted k-mers to the feature map, we add

15.6. SEQUENCE CLASSIFICATION

525

all the k-mers that are m mismatches away from the k-mer. For example, at a mismatch level of m = 1, the following k-mers are added to the feature map for each instance of ATG:


CTG, GTG, TTG, ACG, AAG, AGG, ATC, ATA, ATT


This procedure is repeated for each element in the k -mer, and each of the neighborhood ele-ments are added to the feature map. is procedure is repeated for each element in the k-mer, and each of the neighborhood elements are added to the feature map. The dot product is performed on this expanded feature map Φ(·). The rationale for adding mismatches is to allow for some noise in the similarity computation. The bag-of-words kernel can be viewed as a special case of the spectrum kernel with k = 1 and no mismatches. The spectrum kernel can be computed efficiently with the use of either the trie or the suffix tree data structure. Pointers to such efficient computational methods are provided in the bibliographic notes. One advantage of spectrum kernels is that they can compute the similarity between two strings in an intuitively appealing way, even when the lengths of the two strings are widely varying.


15.6.4.3 Weighted Degree Kernel

The previous two kernel methods directly define a feature map Φ(·) explicitly that largely ignores the ordering between the different k-mers. The weighted degree kernel directly defines K(Yi, Yj ), without explicitly defining a feature map Φ(·). This approach is in the spirit of exploiting the full power of kernel methods. Consider two strings Yi and Yj of the same length n. Let KM ER(Y i , r, k) represent the k-mer extracted from Yi starting from position r. Then, the weighted degree kernel computes the kernel similarity as the number of times the k -mers of a maximum specified length, in the two strings at exactly correspond-ing positions, match perfectly. Thus, unlike spectrum kernels, k-mers of varying lengths are used, and the contribution of a particular length s is weighted by coefficient βs. In other words, weighted degree kernel of order k is defined as follows:





  • n−s+1


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   315   316   317   318   319   320   321   322   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin