Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə247/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   243   244   245   246   247   248   249   250   ...   423
1-Data Mining tarjima

E(y)



G(y) +

nf · e

.

(12.23)










m










Here, e represents the base of the natural logarithm.



12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS

405

Proof: The expected number of spurious items hashed to the cells belonging to item y is about3 nf /m, if all spurious items are hashed uniformly at random to the different cells. This result uses pairwise independence of hash functions because it relies on the fact that the mapping of y to a cell does not affect the distribution of another spurious item in its cells. The probability of the number of spurious items exceeding nf · e/m in any of the w cells belonging to y is given by at most e−1 by the Markov inequality. For E(y) to exceed the upper bound of Eq. 12.23, this violation needs to be repeated for all the w cells to which y is mapped. The probability of a violation of Eq. 12.23 is therefore e−w. The result follows.


In many cases, it is desirable to directly control the error level and the error probability




δ. By setting m = e/ and w = ln(1), it is possible to bound the error with a user-defined tolerance nf · and probability at least 1 − δ. Two natural generalizations of the point query can be implemented as follows:



    1. If the stream elements have arbitrary positive frequencies associated with them, the only change required is to the update operation, where the counts are incremented by the relevant frequency. The frequency bound is identical to Eq. 12.23, with nf representing the sum of the frequencies of the stream items.




    1. If the stream elements have either positive or negative frequencies associated with them, then a further change is required to the query procedure. In this case, the median of the counts is reported. The corresponding error bound of Eq. 12.23 now needs to be modified. With a probability of at least 1−e−w/4, the estimated frequency E(y) of item y lies in the following ranges:




G(y)



3nf

· e



E(y)




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   243   244   245   246   247   248   249   250   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin