Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə252/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   248   249   250   251   252   253   254   255   ...   423
1-Data Mining tarjima

12.3. FREQUENT PATTERN MINING IN DATA STREAMS

409

The intuition behind this result is quite simple. For a uniformly distributed hash func-tion, the probability of R trailing zeros in the binary representation of a stream element is equal to 2−R −1. Therefore, for n distinct elements and a fixed value of R, the expected num-ber of times that exactly R trailing zeros are achieved is equal to 2−R−1 · n. Therefore, for values of R larger than log(n), the expected number of such bitstrings falls off exponentially less than 1. Of course, in our application, the value of R is not fixed, but it is a random variable that is generated by the outcome of the hash function. It has been rigorously shown that the expected value E[Rmax] of the maximum value of R over all stream elements is logarithmically related to the number of distinct elements as follows:





E[Rmax] = log2(φn), φ = 0.77351.

(12.30)

The standard deviation is σ(Rmax) = 1.12. Therefore, the value of 2Rmax /φ provides an estimate for the number of distinct elements n. To further improve the estimate of Rmax, the following techniques can be used:





  1. Multiple hash functions can be used, and the average value of Rmax over the different hash functions is used.




  1. The averages are still somewhat susceptible to large variations. Therefore, the “mean– median trick” may be used. The medians of a set of averages are reported. Note that this is similar to the trick used in the AMS sketch. As in that case, a combination of the Chebychev inequality and Chernoff bounds can be used to establish qualitative guarantees.

It should be pointed out that the bloom filter can also be used to estimate the number of distinct elements. However, the bloom filter is not a space- efficient way to count the number of distinct elements when set-membership queries are not required.


12.3 Frequent Pattern Mining in Data Streams


The frequent pattern mining problem in data streams is studied in the context of two different scenarios. The first scenario is the massive-domain scenario, in which the number of possible items is very large. In such cases, even the problem of finding frequent items becomes difficult. Frequent items are also referred to as heavy hitters. The second case is the conventional scenario of a large (but manageable) number of items that fit in main memory. In such cases, the frequent item problem is no longer quite as interesting, because the frequent counts can be directly maintained in an array. In such cases, one is more interested in determining frequent patterns. This is a difficult problem, because most frequent pattern mining algorithms require multiple passes over the entire data set. The one-pass constraint of the streaming scenario makes this difficult. In the following, two different approaches will be described. The first of these approaches leverages generic synopsis structures in conjunction with traditional frequent pattern mining algorithms and the second designs streaming versions of frequent pattern mining algorithms.


12.3.1 Leveraging Synopsis Structures


Synopsis structures can be used effectively in most streaming data mining problems, includ-ing frequent pattern mining. In the context of frequent pattern mining methods, synopsis structures are particularly attractive because of the ability to use a wider array of algo-rithms, or for incorporating temporal decay into the frequent pattern mining process.


410 CHAPTER 12. MINING DATA STREAMS


12.3.1.1 Reservoir Sampling


Reservoir sampling is the most flexible approach for frequent pattern mining in data streams. It can be used either for frequent item mining (in the massive-domain scenario) or for frequent pattern mining. The basic idea in using reservoir sampling is simple:





  1. Maintain a reservoir sample S from the data stream.




  1. Apply a frequent pattern mining algorithm to the reservoir sample S and report the patterns.

It is possible to derive qualitative guarantees on the frequent patterns mined as a function of the sample size S. The probability of a pattern being a false positive can be determined by using the Chernoff bound. By using modestly lower support thresholds, it is also possible to obtain a guaranteed reduction in the number of false negatives. The bibliographic notes contain pointers to such guarantees. Reservoir sampling has several flexibility advantages because of its clean separation of the sampling and the mining process. Virtually, any efficient frequent pattern mining algorithm can be used on the memory-resident reservoir sample. Furthermore, different variations of pattern mining algorithms, such as constrained pattern mining or interesting pattern mining, can be applied as well. Concept drift is also relatively easy to address. The use of a decay-biased reservoir sample with off -the-shelf frequent pattern mining methods translates to a decay-weighted definition of the support.


12.3.1.2 Sketches


Sketches can be used for determining frequent items, though they cannot be used for deter-mining frequent itemsets quite as easily. The core idea is that sketches are generally much better at estimating the counts of more frequent items accurately on a relative basis. This is because the bound on the frequency estimation of any item is an absolute one, in which the error depends on the aggregate frequency of the stream items rather than that of the item itself. This is evident from Lemma 12.2.3. As a result, the frequencies of heavy hitters can generally be estimated more accurately on a relative basis. Both the AMS sketch and the count-min sketch can be used to determine the heavy hitters. The bibliographic notes contain pointers to some of these algorithms.


12.3.2 Lossy Counting Algorithm

The lossy counting algorithm can be used either for frequent item, or frequent itemset counting. The approach divides the stream into segments S1 . . . Si . . . such that each segment Si has a size w = 1/ . The parameter is a user-defined tolerance on the required accuracy.


First, the easier problem of frequent item mining will be described. The algorithm main-tains the frequencies of all the items in an array and increments them as new items arrive. If the number of distinct items is not very large, then one can maintain all the counts and report the frequent ones. The problem arises when the total available space is less than that required to maintain the counts of the distinct items. In such cases, whenever the boundary of a segment Si is reached, infrequent items are dropped. This results in the removal of many items because the vast majority of the items in the stream are infrequent in practice. How does one decide which items should be dropped, to retain a quality bound on the approximation? For this purpose, a decremental trick is used.


Whenever the boundary of a segment Si is reached, the frequency count of every item in the array is decreased by 1. After the decrease, items with zero frequencies are pruned from




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   248   249   250   251   252   253   254   255   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin