E(y)
|
≤
|
G(y) +
|
nf · e
|
.
|
(12.23)
|
|
|
|
m
|
|
|
Here, e represents the base of the natural logarithm.
12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS
|
405
|
Proof: The expected number of spurious items hashed to the cells belonging to item y is about3 nf /m, if all spurious items are hashed uniformly at random to the different cells. This result uses pairwise independence of hash functions because it relies on the fact that the mapping of y to a cell does not affect the distribution of another spurious item in its cells. The probability of the number of spurious items exceeding nf · e/m in any of the w cells belonging to y is given by at most e−1 by the Markov inequality. For E(y) to exceed the upper bound of Eq. 12.23, this violation needs to be repeated for all the w cells to which y is mapped. The probability of a violation of Eq. 12.23 is therefore e−w. The result follows.
In many cases, it is desirable to directly control the error level and the error probability
δ. By setting m = e/ and w = ln(1/δ), it is possible to bound the error with a user-defined tolerance nf · and probability at least 1 − δ. Two natural generalizations of the point query can be implemented as follows:
If the stream elements have arbitrary positive frequencies associated with them, the only change required is to the update operation, where the counts are incremented by the relevant frequency. The frequency bound is identical to Eq. 12.23, with nf representing the sum of the frequencies of the stream items.
If the stream elements have either positive or negative frequencies associated with them, then a further change is required to the query procedure. In this case, the median of the counts is reported. The corresponding error bound of Eq. 12.23 now needs to be modified. With a probability of at least 1−e−w/4, the estimated frequency E(y) of item y lies in the following ranges:
Dostları ilə paylaş: |