G(y) +
|
3nf · e
|
.
|
(12.24)
|
|
|
m
|
|
|
|
|
|
|
m
|
|
|
In this case, nf represents the sum of the absolute frequencies of the incoming items in the data stream. The bounds in this case are much weaker than those for nonnegative elements.
A useful application is to determine the dot product of the frequency counts of the discrete attribute values in two data streams. This has a useful application in estimating the join size on the massive-domain attribute in two data streams. The dot product between the frequency counts of the items in a pair of nonnegative data streams can be estimated by first constructing a count-min sketch representation for each of the two data streams in a separate w × m count-min data structure. The same hash functions are used for both sketches. The dot product of their corresponding count-min arrays for each hash function is computed. The minimum value of the dot product over the w different arrays is reported as the estimation. As in the previous case, this is an overestimate, and an upper bound on the estimate may be obtained with a probability of at least 1 − e−w. The corresponding error tolerance for the upper bound is n1f ·n2f · e/m, where n1f and n2f are the aggregate frequencies of the items in each of the two streams. Other useful queries with the use of the count-min sketch include the determination of quantiles and frequent elements. Frequent elements are also referred to as heavy hitters. The bibliographic notes contain pointers to various queries and applications that can be addressed with the use of the count-min sketch.
It is exactly equal to ns/m, where ns is the frequency of all items other than y. However, ns is less than nf by the frequency of y.
406 CHAPTER 12. MINING DATA STREAMS
12.2.2.3 AMS Sketch
As discussed at the beginning of this section, different synopsis structures are designed for different kinds of queries. While the bloom filter and count-min sketch provide good esti-mations of various queries, some queries, such as second moments, can be better addressed with the Alon–Matias–Szegedy (AMS) sketch. In the AMS sketch, a random binary value is generated from {−1, 1} for each stream element by applying a hash function to the stream element. These binary values are assumed to be 4-wise independent. This means that, if at most four values generated from the same hash function are sampled, they will be statisti-cally independent of one another. It is easier to design a 4-wise independent hash function than a fully independent hash function. The details of 4-wise independent hash functions may be found in the bibliographic notes.
Consider a stream in which the ith stream element is associated with the aggregate frequency fi . The second-order moment F2 of the data stream, for a stream with n distinct elements, is defined as follows:
Dostları ilə paylaş: |