Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	319/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 315 316 317 318 319 320 321 322 ... 423

1-Data Mining tarjima

K(Y_i, Y_j ) = Φ(Y_i) · Φ(Y_j ) = V (Y_i) · V (Y_j )

The main disadvantage of the kernel is that it loses all the positioning information between the alphabets. This can be an eﬀective approach for cases where the alphabet size is large. An example is text, where the alphabet (lexicon) size is of a few hundred thousand words. However, for smaller alphabet sizes, the information loss can be too significant for the resulting classifier to be useful.

15.6.4.2 Spectrum Kernel

The bag-of-words kernel loses all the sequential information in the strings. The spectrum kernel addresses this issue by extracting k-mers from the strings and using them to con-struct the vector-space representation. The simplest spectrum kernel pulls out all k-mers from the strings and builds a vector space representation from them. For example, consider the string ATGCGATGG constructed on the alphabet Σ = {A, C, T, G}. Then, the correspond-ing spectrum representation for k = 3 is as follows:

ATG(2), TGC(1), GCG(1), CGA(1), GAT(1), TGG(1)

The values in the brackets correspond to the frequencies in the vector -space representa-tion. This corresponds to the feature map Φ(·) used to define the kernel similarity.

It is possible to enhance the spectrum kernel further by adding a mismatch neighborhood to the kernel. Thus, instead of adding only the extracted k-mers to the feature map, we add

15.6. SEQUENCE CLASSIFICATION

525

all the k-mers that are m mismatches away from the k-mer. For example, at a mismatch level of m = 1, the following k-mers are added to the feature map for each instance of ATG:

CTG, GTG, TTG, ACG, AAG, AGG, ATC, ATA, ATT

This procedure is repeated for each element in the k -mer, and each of the neighborhood ele-ments are added to the feature map. is procedure is repeated for each element in the k-mer, and each of the neighborhood elements are added to the feature map. The dot product is performed on this expanded feature map Φ(·). The rationale for adding mismatches is to allow for some noise in the similarity computation. The bag-of-words kernel can be viewed as a special case of the spectrum kernel with k = 1 and no mismatches. The spectrum kernel can be computed eﬃciently with the use of either the trie or the suﬃx tree data structure. Pointers to such eﬃcient computational methods are provided in the bibliographic notes. One advantage of spectrum kernels is that they can compute the similarity between two strings in an intuitively appealing way, even when the lengths of the two strings are widely varying.

15.6.4.3 Weighted Degree Kernel

The previous two kernel methods directly define a feature map Φ(·) explicitly that largely ignores the ordering between the diﬀerent k-mers. The weighted degree kernel directly defines K(Y_i, Y_j ), without explicitly defining a feature map Φ(·). This approach is in the spirit of exploiting the full power of kernel methods. Consider two strings Y_i and Y_j of the same length n. Let KM ER(Y _i , r, k) represent the k-mer extracted from Y_i starting from position r. Then, the weighted degree kernel computes the kernel similarity as the number of times the k -mers of a maximum specified length, in the two strings at exactly correspond-ing positions, match perfectly. Thus, unlike spectrum kernels, k-mers of varying lengths are used, and the contribution of a particular length s is weighted by coeﬃcient β_s. In other words, weighted degree kernel of order k is defined as follows:

n−s+1

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 315 316 317 318 319 320 321 322 ... 423