Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	55/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 51 52 53 54 55 56 57 58 ... 423

1-Data Mining tarjima

r

Sim(X, Y ) =M atch(X_i, Y_i).	(3.19)

i=1

variety of measures discussed in this section may be used to instantiate the value of M atch(X_i, Y_i). It is tricky to determine the proper value of M atch(X_i, Y_i) because a contiguous match along a long window is more unusual than many short segments of the same length. The proper choice of M atch(X_i, Y_i) may depend on the application at hand. Another problem is that the optimal decomposition of the series into windows may be a diﬃcult task. These methods are not discussed in detail here, but the interested reader is referred to the bibliographic notes for pointers to relevant methods.

3.4.2 Discrete Sequence Similarity Measures

Discrete sequence similarity measures are based on the same general principles as time-series similarity measures. As in the case of time-series data, discrete sequence data may or may not have a one-to-one mapping between the positions. When a one-to-one mapping does exist, many of the multidimensional categorical distance measures can be adapted to this domain, just as the L_p-norm can be adapted to continuous time series. However, the application domains of discrete sequence data are most often such that a one-to -one mapping does not exist. Aside from the DTW approach, a number of other dynamic programming methods are commonly used.

3.4.2.1 Edit Distance

The edit distance defines the distance between two strings as the least amount of “eﬀort” (or cost) required to transform one sequence into another by using a series of transformation operations, referred to as “edits.” The edit distance is also referred to as the Levenshtein distance. The edit operations include the use of symbol insertions, deletions, and replace-ments with specific costs. In many models, replacements are assumed to have higher cost than insertions or deletions, though insertions and deletions are usually assumed to have the same cost. Consider the sequences ababababab and bababababa, which are drawn on the alphabet {a, b}. The first string can be transformed to the second in several ways. For example, if every alphabet in the first string was replaced by the other alphabet, it would result in the second string. The cost of doing so is that of ten replacements. However, a more cost-eﬃcient way of achieving the same goal is to delete the leftmost element of the string, and insert the symbol “a” as the rightmost element. The cost of this sequence of operations is only one insertion and one deletion. The edit distance is defined as the optimal cost to

3.4. TEMPORAL SIMILARITY MEASURES

transform one string to another with a sequence of insertions, deletions, and replacements.

The computation of the optimal cost requires a dynamic programming recursion.

For two sequences X = (x₁ . . . x_m ) and Y = (y₁ . . . y_n), let the edits be performed on sequence X to transform to Y . Note that this distance function is asymmetric because of the directionality to the edit. For example, Edit(X, Y ) may not be the same as Edit(Y , X) if the insertion and deletion costs are not identical. In practice, however, the insertion and deletion costs are assumed to be the same.

Let I_ij be a binary indicator that is 0 when the ith symbol of X and jth symbols of Y are the same. Otherwise, the value of this indicator is 1. Then, consider the first i symbols of X and the first j symbols of Y . Assume that these segments are represented by X_i and Y_j , respectively. Let Edit(i, j) represent the optimal matching cost between these segments. The goal is to determine what operation to perform on the last element of X_i so that it either matches an element in Y_j , or it is deleted. Three possibilities arise:

The last element of X_i is deleted, and the cost of this is [Edit(i−1, j)+Deletion Cost]. The last element of the truncated segment X_i−₁ may or may not match the last element of Y_j at this point.

An element is inserted at the end of X_i to match the last element of Y_j , and the cost of this is [Edit(i, j − 1) + Insertion Cost]. The indices of the edit term Edit(i, j − 1) reflect the fact that the matched elements of both series can now be removed.

The last element of X_i is flipped to that of Y_j if it is diﬀerent, and the cost of this is [Edit(i − 1, j − 1) + I_ij · (Replacement Cost)]. In cases where the last elements are the same, the additional replacement cost is not incurred, but progress is nevertheless made in matching. This is because the matched elements (x_i, y_j ) of both series need not be considered further, and residual matching cost is Edit(i − 1, j − 1).

Clearly, it is desirable to pick the minimum of these costs for the optimal matching. There-fore, the optimal matching is defined by the following recursion:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 51 52 53 54 55 56 57 58 ... 423