Edit(i − 1, j) + Deletion Cost
|
|
|
|
|
−
|
|
|
|
Edit(i, j) = min
|
Edit(i, j
|
|
1) + Insertion Cost
|
.
|
(3.20)
|
|
Edit(i − 1, j − 1) + Iij · (Replacement Cost)
|
|
|
|
|
|
|
|
|
Furthermore, the bottom of the recursion also needs to be set up. The value of Edit(i, 0) is equal to the cost of i deletions for any value of i, and that of Edit (0, j) is equal to the cost of j insertions for any value of j. This nicely sets up the dynamic programming approach. It is possible to write the corresponding computer program either as a nonrecursive nested loop (as in DTW) or as a recursive computer program that directly uses the aforementioned cases.
The aforementioned discussion assumes general insertion, deletion, and replacement costs. In practice, however, the insertion and deletion costs are usually assumed to be the same. In such a case, the edit function is symmetric because it does not matter which of the two strings is edited to the other. For any sequence of edits from one string to the other, a reverse sequence of edits, with the same cost, will exist from the other string to the first.
The edit distance can be extended to numeric data by changing the primitive operations of insert, delete, and replace to transformation rules that are designed for time series. Such transformation rules can include making basic changes to the shape of the time series in
84 CHAPTER 3. SIMILARITY AND DISTANCES
window segments. This is more complex because it requires one to design the base set of allowed time-series shape transformations and their costs. Such an approach has not found much popularity for time-series distance computation.
3.4.2.2 Longest Common Subsequence
A subsequence of a sequence is a set of symbols drawn from the sequence in the same order as the original sequence. A subsequence is different from a substring in that the values of the subsequence need not be contiguous, whereas the values in the substring need to be contiguous. Consider the sequences agbf cgdhei and af bgchdiei. In this case, ei is a substring of both sequences and also a subsequence. However, abcde and f gi are subsequences of both strings but not substrings. Clearly, subsequences of longer length are indicative of a greater level of matching between the strings. Unlike the edit distance, the longest common subsequence (LCSS) is a similarity function because higher values indicate greater similarity. The number of possible subsequences is exponentially related to the length of a string. However, the LCSS can be computed in polynomial time with a dynamic programming approach.
For two sequences X = (x1 . . . xm) and Y = (y1 . . . yn), consider the first i symbols of
and the first j symbols of Y . Assume that these segments are represented by Xi and Yj , respectively. Let LCSS(i, j) represent the optimal LCSS values between these segments. The goal here is to either match the last element of Xi and Yj , or delete the last element in one of the two sequences. Two possibilities arise:
The last element of Xi matches Yj , in which case, it cannot hurt to instantiate the matching on the last element and then delete the last element of both sequences. The similarity value LCSS(i, j) can be expressed recursively as this is LCSS(i−1, j−1)+1.
The last element does not match. In such a case, the last element of at least one of the two strings needs to be deleted under the assumption that it cannot occur in the matching. In this case, the value of LCSS(i, j) is either LCSS(i, j − 1) or LCSS(i − 1, j), depending on which string is selected for deletion.
Therefore, the optimal matching can be expressed by enumerating these cases:
Dostları ilə paylaş: |