Data Mining: The Textbook
This chapter is organized as follows. Section 12.2 introduces various types of synop-sis construction methods for data streams. Section 12.3 discusses frequent pattern mining methods for data streams. Clustering methods are discussed in Sect. 12.4, and outlier anal-ysis methods are discussed in Sect. 12.5. Classification methods are introduced in Sect. 12.6. Section 12.7 gives a summary. 12.2 Synopsis Data Structures for Streams A wide variety of synopsis data structures have been designed for different applications. Synopsis data structures are of two types: Generic: In this case, the synopsis can be used for most applications directly. The only such synopsis is a random sample of the data points, although it cannot be used for some applications such as distinct element counting. In the context of data streams, the process of maintaining a random sample from the data is also referred to as reservoir sampling. Specific: In this case, the synopsis is designed for a specific task, such as frequent ele-ment counting or distinct element counting. Examples of such data structures include the Flajolet–Martin data structure for distinct element counting, and sketches for frequent element counting or moment computation. In the following, different types of synopsis structures will be discussed. 12.2.1 Reservoir Sampling Sampling is one of the most flexible methods for stream summarization. The main advantage of sampling over other synopsis data structures is that it can be used for an arbitrary application. After a sample of points has been drawn from the data, virtually any offline algorithm can be applied to the sample. By default, sampling should be considered the method of choice in streaming scenarios, although it does have limitations for a small number of applications such as distinct-element counting. In the context of data streams, the methodology used to maintain a dynamic sample from the data is referred to as reservoir sampling. The resulting sample is referred to as a reservoir sample. The method of reservoir sampling is introduced briefly in Sect. 2.4.1.2 of Chap. 2. The streaming scenario creates some interesting challenges for a simple problem such as sampling. The challenge arises from the fact that one cannot store the entire stream on disk for sampling. In reservoir sampling, the goal is to continuously maintain a dynamically updated sample of k points from a data stream without explicitly storing the stream on disk at any given point in time. Therefore, for each incoming data point in the stream, one must use a set of efficiently implementable operations to maintain the sample. In the static case, the probability of including a data point in the sample is k/n, where k is the sample size, and n is the number of points in the data set. In the streaming scenario, the “data set” is not static, and the value of n continually increases with time. Furthermore, previously arrived data points, which are not included in the sample, have been irrevocably lost. Thus, the sampling approach works with incomplete knowledge about the previous history of the stream at any given moment in time. In other words, for each incoming data point in the stream, one needs to make two simple admission control decisions dynamically: What sampling rule should be used to decide whether to include the incoming data point in the sample? 392 CHAPTER 12. MINING DATA STREAMS What rule should be used to decide how to eject a data point from the sample to “make room” for the newly inserted data point? The reservoir sampling algorithm proceeds as follows. For a reservoir of size k , the first k data points in the stream are always included in the reservoir. Subsequently, for the nth incoming stream data point, the following two admission control decisions are applied. Insert the nth incoming stream data point in the reservoir with probability k/n. If the newly incoming data point was inserted, then eject one of the old k data points in the reservoir at random to make room for the newly arriving point. It can be shown that the aforementioned rule maintains an unbiased reservoir sample from the data stream. Lemma 12.2.1 After n stream points have arrived, the probability of any stream point being included in the reservoir is the same, and is equal to k/n. Proof: This result is easy to show by induction. At initialization of the first k data points, the theorem is trivially true. Let us (inductively) assume that it is also true after (n − 1) data points have been received. Therefore, the probability of each point being included in the reservoir is k/(n − 1). The lemma is trivially true for the arriving data point because the probability of its being included in the stream is k/n. It remains to prove the result for the remaining points in the data stream. There are two disjoint case events that can arise for an incoming data point, and the final probability of a point being included in the reservoir is the sum of these two cases: The incoming data point is not inserted into the reservoir. The probability of this is (n−k)/n. Because the original probability of any point being included in the reservoir by the inductive assumption, is k/(n − 1), the overall probability of a point being included in the reservoir and the occurrence of the Case I event, is the multiplicative value of p1 = k(n−k) . Yüklə 17,13 Mb. Dostları ilə paylaş: |