Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə299/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   295   296   297   298   299   300   301   302   ...   423
1-Data Mining tarjima

System diagnosis: Many automated systems generate discrete sequences containing information about the system state. Examples of system state are UNIX system calls, aircraft system states, mechanical system states, or network intrusion states.




  1. Biological data: Biological data typically contains sequences of amino acids. Specific patterns in these sequences may define important properties of the data.




  1. User-action sequences: Various sequences are generated by user actions in different domains.




    1. Web logs contain long sequences of visits to Websites by different individuals.




    1. Customer transactions may contain sequences of buying behavior. The individual elements of the sequence may correspond to the identifiers of the different items, or sets of identifiers of different items that are bought.




    1. User actions on Websites, such as online banking sites, are frequently logged. This case is similar to that of Web logs, except that the logs of banking sites often contain more detailed information for security purposes.

Biological data is a special kind of sequence data, in which the contextual data is not temporal but relates to the placement of the different attributes. Methods for temporal






C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 15

493

c Springer International Publishing Switzerland 2015



494 CHAPTER 15. MINING DISCRETE SEQUENCES

sequence data can be leveraged for biological sequence data and vice versa. A discrete sequence is formally defined as follows:


Definition 15.1.1 (Discrete Sequence Data) A discrete sequence Y1 . . . Yn of length n and dimensionality d, contains d discrete feature values at each of n different timestamps t1 . . . tn. Each of the n components Yi contains d discrete behavioral attributes (yi1 . . . yid), collected at the ith timestamp.


In many practical scenarios, the timestamps t1 . . . tn may simply be tick values indexed from 1 through n. This is especially true in cases such as biological data, in which the contextual attribute represents placement. In general, the actual timestamps are rarely used in most sequence mining applications, and the discrete sequence values are assumed to be equally spaced in time. Furthermore, most of the analytical techniques are designed for the case where d = 1. Such discrete sequences are also referred to as strings. This chapter will, therefore, use these terms interchangeably. Most of the discussion in this chapter focuses on these more common and simpler cases.


In some applications, such as sequential pattern mining, each Yi is not a vector but a set of unordered values. This is a variation from Definition 15.1.1. Therefore, the notation Yi (without an overline) will be used to denote a set rather than a vector. For example, in a supermarket application, the set Yi may represent a set of items bought by the customer at a given time. There is no temporal ordering among the items in Yi . In a Web log analysis application, the set Yi represents the Web pages browsed by a given user in a single session. Thus, discrete sequences can be defined in a wider variety of ways than timeseries data. This is because of the ability to define sets on discrete items. Each position in the sequence is also referred to as an element and is composed of individual items in the set. Throughout this chapter, the word “element” will refer to one of the sets of items within the sequence, including a 1-itemset.


These variations in definitions arise out of a natural variation in the different kinds of application scenarios. This chapter will study the different problem definitions relevant to discrete sequence mining. The four major problems of pattern mining, clustering, outlier analysis, and classification, are each defined differently for discrete sequence mining than for multidimensional data. These different definitions will be studied in detail. A few models, such as Hidden Markov Models, are used widely across many different application domains. These commonly used models will also be studied in this chapter.


This chapter is organized as follows. Section 15.2 introduces the problem of sequential pattern mining. Sequence clustering algorithms are discussed in Sect. 15.3. The problem of sequence outlier detection is discussed in Sect. 15.4. Section 15.5 introduces Hidden Markov Models (HMM) that can be used for either clustering, classification, or outlier detection. The problem of sequence classification is addressed in Sect. 15.6. Section 15.7 gives a summary.


15.2 Sequential Pattern Mining


The problem of sequential pattern mining can be considered the temporal analog of fre-quent pattern mining. In fact, most algorithms for frequent pattern mining can be directly adapted to sequential pattern mining with a systematic approach, although the latter prob-lem is more complex. As in frequent pattern mining, the original motivating application for sequential pattern mining was market basket analysis, although the problem is now used in a wider variety of temporal application domains, such as computer systems, Web logs, and telecommunication applications.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   295   296   297   298   299   300   301   302   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin