Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	299/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 295 296 297 298 299 300 301 302 ... 423

1-Data Mining tarjima

System diagnosis: Many automated systems generate discrete sequences containing information about the system state. Examples of system state are UNIX system calls, aircraft system states, mechanical system states, or network intrusion states.

Biological data: Biological data typically contains sequences of amino acids. Specific patterns in these sequences may define important properties of the data.

User-action sequences: Various sequences are generated by user actions in diﬀerent domains.

Web logs contain long sequences of visits to Websites by diﬀerent individuals.

Customer transactions may contain sequences of buying behavior. The individual elements of the sequence may correspond to the identifiers of the diﬀerent items, or sets of identifiers of diﬀerent items that are bought.

User actions on Websites, such as online banking sites, are frequently logged. This case is similar to that of Web logs, except that the logs of banking sites often contain more detailed information for security purposes.

Biological data is a special kind of sequence data, in which the contextual data is not temporal but relates to the placement of the diﬀerent attributes. Methods for temporal

C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 15

493

c Springer International Publishing Switzerland 2015

494 CHAPTER 15. MINING DISCRETE SEQUENCES

sequence data can be leveraged for biological sequence data and vice versa. A discrete sequence is formally defined as follows:

Definition 15.1.1 (Discrete Sequence Data) A discrete sequence Y₁ . . . Y_n of length n and dimensionality d, contains d discrete feature values at each of n diﬀerent timestamps t₁ . . . t_n. Each of the n components Y_i contains d discrete behavioral attributes (y_i¹ . . . y_i^d), collected at the ith timestamp.

In many practical scenarios, the timestamps t₁ . . . t_n may simply be tick values indexed from 1 through n. This is especially true in cases such as biological data, in which the contextual attribute represents placement. In general, the actual timestamps are rarely used in most sequence mining applications, and the discrete sequence values are assumed to be equally spaced in time. Furthermore, most of the analytical techniques are designed for the case where d = 1. Such discrete sequences are also referred to as strings. This chapter will, therefore, use these terms interchangeably. Most of the discussion in this chapter focuses on these more common and simpler cases.

In some applications, such as sequential pattern mining, each Y_i is not a vector but a set of unordered values. This is a variation from Definition 15.1.1. Therefore, the notation Y_i (without an overline) will be used to denote a set rather than a vector. For example, in a supermarket application, the set Y_i may represent a set of items bought by the customer at a given time. There is no temporal ordering among the items in Y_i . In a Web log analysis application, the set Y_i represents the Web pages browsed by a given user in a single session. Thus, discrete sequences can be defined in a wider variety of ways than timeseries data. This is because of the ability to define sets on discrete items. Each position in the sequence is also referred to as an element and is composed of individual items in the set. Throughout this chapter, the word “element” will refer to one of the sets of items within the sequence, including a 1-itemset.

These variations in definitions arise out of a natural variation in the diﬀerent kinds of application scenarios. This chapter will study the diﬀerent problem definitions relevant to discrete sequence mining. The four major problems of pattern mining, clustering, outlier analysis, and classification, are each defined diﬀerently for discrete sequence mining than for multidimensional data. These diﬀerent definitions will be studied in detail. A few models, such as Hidden Markov Models, are used widely across many diﬀerent application domains. These commonly used models will also be studied in this chapter.

This chapter is organized as follows. Section 15.2 introduces the problem of sequential pattern mining. Sequence clustering algorithms are discussed in Sect. 15.3. The problem of sequence outlier detection is discussed in Sect. 15.4. Section 15.5 introduces Hidden Markov Models (HMM) that can be used for either clustering, classification, or outlier detection. The problem of sequence classification is addressed in Sect. 15.6. Section 15.7 gives a summary.

15.2 Sequential Pattern Mining

The problem of sequential pattern mining can be considered the temporal analog of fre-quent pattern mining. In fact, most algorithms for frequent pattern mining can be directly adapted to sequential pattern mining with a systematic approach, although the latter prob-lem is more complex. As in frequent pattern mining, the original motivating application for sequential pattern mining was market basket analysis, although the problem is now used in a wider variety of temporal application domains, such as computer systems, Web logs, and telecommunication applications.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 295 296 297 298 299 300 301 302 ... 423