Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə251/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   247   248   249   250   251   252   253   254   ...   423
1-Data Mining tarjima

P (Y > q/2) = P (Y > (1 + 3) · q/8) = P (Y > (1 + 3)E[Y ]) ≤ e32·ln(1)/8 = δ9/8 ≤ δ.

The median can violate the -bound only when more than half the averages violate the bound. The probability of this event is exactly P (Y > q/2). Therefore, the median violates the -bound with probability at most δ.


The AMS sketch can be used to estimate many other values in a similar way, with corre-sponding quality bounds. For example, consider two streams with the sketch components Qi and Ri.





  1. The dot product between the frequency counts of the items in a pair of streams is estimated as the product of the corresponding sketch components Qi and Ri. By using

408 CHAPTER 12. MINING DATA STREAMS


the median of O(ln(1)) averages of different sets of O(1/ 2) values of Qi · R i, it is possible to bound the approximation within 1 ± with probability at least 1 − δ. This estimation can be performed using the count-min sketch as well, though with a different bound.





  1. The Euclidean distance between the frequency counts of a pair of streams can be estimated as Q2i + Ri2 2Qi · Ri. The Euclidean distance can be viewed as a linear combination of three different dot products (including self-products) between the fre-quency counts of the two streams. Because each dot product is itself bounded using the “mean–median trick” discussed above, the approach can be used to determine similar quality bounds in this case as well.




  1. Like the count-min sketch, the AMS sketch can be used to estimate frequency values. For the jth distinct stream element with frequency fj , the product of the random variable rj and Qi provides an estimate of the frequency.




E[fj ] = rj · Qi.

(12.29)

The mean, median, or mean–median combination of these values over different sketch components Qi can be reported as a robust estimate. The AMS sketch can also be used to identify heavy hitters from the data stream.


Some of the queries resolved by the AMS and count-min sketch are similar, although others are different. The bounds provided by the two techniques are also different, although none of them is strictly better than the other in all scenarios. The count-min sketch does have the advantage of being intuitively easy to interpret because of its natural hash-table data structure. As a result, it can be more naturally integrated in data mining applications such as clustering and classification in a seamless way.


12.2.2.4 Flajolet–Martin Algorithm for Distinct Element Counting


Sketches are designed for determining stream statistics that are dominated by large aggregate signals of frequent items. However, they are not optimized for estimating stream statistics that are dominated by infrequently occurring items. Problems such as distinct element counting are more directly influenced by the much larger number of infrequent items in a data stream. Distinct element counting can be performed efficiently with the Flajolet– Martin algorithm.


The Flajolet–Martin algorithm uses a hash function h(·) to render a mapping from a given element x in the data stream to an integer in the range [0, 2L − 1]. The value of L is selected to be large enough, so that 2L is an upper bound on the number of distinct elements. Usually, the value L is selected to be 64 for implementation convenience, and because the value of 264 is large enough for most practical applications. Therefore, the binary representation of the integer h(x) will have length L. The position4 R of the rightmost 1 bit of the binary representation of the integer h(x) is determined. Thus, the value of R represents the number of trailing zeros in this binary representation. Let Rmax be the maximum value of R over all stream elements. The value of Rmax can be maintained incrementally in the streaming scenario by applying the hash function to each incoming stream element, determining its rightmost bit, and then updating Rmax. The key idea in the Flajolet–Martin algorithm is that the dynamically maintained value of R max is logarithmically related to the number of distinct elements encountered so far in the stream.








  • The position of the least significant bit is 0, the next most significant bit is 1, and so on.


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   247   248   249   250   251   252   253   254   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin