Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə289/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   285   286   287   288   289   290   291   292   ...   423
1-Data Mining tarjima

Single series versus database of many series: In the first case, a single series is available, and the frequently occurring shapes in specific windows of the series are determined. For example, in Fig. 14.6, the highlighted shape appears three times in the same series and therefore has a count of 3. A different formulation is one in which we have N different series, and the occurrence of a shape at least once in a particular series is given a credit of exactly 1. The frequency is therefore computed in terms of the number of series in which the pattern occurs. The second formulation is much closer to sequential pattern mining in discrete data. Different applications may require different definitions of motif discovery.




  1. Contiguous versus noncontiguous motifs: Contiguous motifs require that the shapes are discovered over a contiguous window of the time series. Noncontiguous motifs may allow gaps between different elements of the motif. Much of the work in time series analysis assumes that the motifs are defined over contiguous windows. Non-contiguous motifs are more common in discrete sequence analysis. Nevertheless, noncontiguous motifs may have utility in some applications.




  1. Multigranularity motifs: Many formulations fix the window size in which the motifs are discovered. However, in practice, the frequent motifs may occur over windows of

14.4. TIME SERIES MOTIFS

473

different sizes. Such motifs are very useful in many application-specific scenarios. For example, in the financial-market series of Fig. 14.11a, an important motif is caused by a “flash crash” event over the course of a day. On the other hand, in Fig. 14.11b, the (recessionary) trend occurs over several months. In the second case, it may be needed to smooth out the local variations to discover motifs. Thus, different techniques are required to discover different types of motifs.


When does a motif belong to a time series? Two methods are typically used by different applications.





  1. Distance-based support: A particular segment of a sequence is said to support a motif when the distance between the segment and the motif is less than a particular thresh-old.




  1. Transformation to sequential pattern mining: A variety of discretizations can be used to convert time series into discrete sequences. After the conversion, motifs can be defined as discrete subsequences of the sequences.

The latter method lends itself to richer classes of algorithms from sequential pattern min-ing. Furthermore, the approach used for discretization can be varied to discover motifs of different kinds. It also allows the discovery of noncontiguous patterns because sequential pattern mining algorithms do not assume contiguity by default. This section will discuss both kinds of methods. In addition, the notion of periodic patterns will be introduced.


14.4.1 Distance-Based Motifs


Distance-based motifs are always defined on contiguous segments of the time series. First, the concept of approximate distance match between a motif and a contiguous segment in a time series needs to be defined.


Definition 14.4.1 (Approximate Distance Match) A sequence (or motif ) S =


s1 . . . sw of real values is said to approximately match a contiguous subsequence of length

  • in the time series (y1, . . . yn) (w ≤ n) starting at position i, if the distance between

(s1, . . . sw) and (yi, . . . yi+w−1) is at most .


A wide variety of distance functions may be used, and the Euclidean distance function is a common choice. The aforementioned definition assumes that the two subsequences being matched have the same length. This is a conservative assumption that allows the use of distance functions such as the Euclidean function. However, if other distance functions, such as dynamic time warping, are used, it may not be necessary for both the matched motifs to have the same length.


The number of occurrences of the motif in a single long series is used to quantify the frequency of the motif. In addition to the series itself, the window length w and the approx-imation threshold are the two main inputs to the algorithm.


Definition 14.4.2 (Motif Count) The number of matches of a time series window S = s1 . . . sw to the time series (y1 . . . yn) at threshold level , is equal to the number of windows of length w in (y1 . . . yn), for which the distance between the corresponding subsequences is at most .


The goal is to discover the top k motifs for a user-defined parameter k . Furthermore, to ensure that the k motifs discovered are very different from one another, a constraint is


474 CHAPTER 14. MINING TIME SERIES DATA


Algorithm FindBestMotif(Time Series: y1 . . . yn, Window: w


Distance Threshold: )

begin
for i = 1 to n − w + 1 do begin


Candidate-Motif= (yi, . . . yi+w−1);

for j = 1 to n − w + 1 do begin


Comparison-Motif= (yj . . . yj+w−1);

  • = ComputeDistance(Candidate-Motif, Comparison-Motif); if (D ≤ ) and (non-trivial match)

then increment count of Candidate-Motif by 1;


endfor
if Candidate-Motif has highest count found so far then update Best-Candidate to Candidate-Motif;


endfor

return Best-Candidate;


end

Figure 14.7: Determining the most frequent motif
imposed; the distances between any pairs of motifs discovered among the top-k motifs must be at least 2 · . In the following, the discovery of the most frequent occurring single motif will be described. The generalization to the case of top k motifs is relatively straightforward. The overall approach [356] uses a nested-loop algorithm to discover the most frequent motif. The approach is described in Fig. 14.7.
The approach extracts all of the candidate motifs of length w from a time series and computes the distances to all of the windows of length w. The number of windows over which the match occurs is counted. Care is taken to exclude trivial matches in the count. Trivial matches are defined as those matches where approximately the same (overlapping) window is being matched. For example, the case when i = j is a trivial match. Furthermore, in the case where i < j, if the window starting at i matches with all windows starting at i + 1, i + 2 . . . j, then the match at j is trivial as well. In the case where i > j, if the window starting at i matches with all windows starting at i − 1, i − 2 . . . j, then the match at j is trivial. Therefore, this condition is explicitly checked in the counting. The best candidate is tracked over the course of the algorithm, and reported at termination. As evident from Fig. 14.7, the approach requires a nested loop, and the number of iterations in each loop is almost equal to the size of the series n. Thus, the approach requires O(n2) distance computations. In principle, any time series distance function, such as DTW, can be used for the computation, although it is generally more expensive.

The majority of the time is spent in distance computations. In many cases, a fast com-putation of the lower bound on the distance can be used to speed up the approach. If the computed lower bound between a pair of windows is greater than , then the pair is guar-anteed to be irrelevant for adding to the candidate motif count. Therefore, the distance computation does not need to be explicitly performed. The piecewise aggregate approxima-tion (PAA) can be used to speed up the distance computations. Consider a scenario where the PAA has been performed over windows of length m. The resulting series has been com-pressed by a factor of m, and therefore the distance computations are much faster. If the




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   285   286   287   288   289   290   291   292   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin