Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə250/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   246   247   248   249   250   251   252   253   ...   423
1-Data Mining tarjima

n




i=1 fi2ri2

i=1

j=1 fi · fj · ri · rj . For any pair




= 1 and E[ri · rj ] = E[ri] · E[rj ] = 0. The last of

12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS

407

these results uses 2-wise independence, which is implied by 4-wise independence. Therefore,



E[Q2] =

n

= F2.




i=1 fi2




The 4-wise independence can also be used to bound the variance of the estimate (see Exer-cise 16).


Lemma 12.2.5 The variance of the square of a component Q of the AMS sketch is bounded above by twice the frequency moment.





V ar[Q2] 2 · F22

(12.28)

The bound on the variance can be reduced further by averaging over the m different sketch components Q1 . . . Qm. The reduced variance can be used to create a (weak) probabilistic estimate on the quality of the second moment estimate with the Chebychev inequality. This can be tightened further by using a “mean–median combination trick” that is commonly used in such a probabilistic analysis. This trick can be used to robustly estimate a random variable, whenever its variance is no larger than a modest factor of the square of its expected value. This is the case for the random variable Q2.

The mean–median combination trick works as follows. It is desired to establish a bound with probability at least (1 − δ) that the second moment can be estimated to within a multiplicative factor of 1± . Let Q 1 . . . Qm be m different sketch components, each of which is generated using a different hash function. The value of m is chosen to be O(ln(1)/ 2). These m sketch components are further partitioned into O(ln(1)) different groups of size O(1/ 2) each. The sketch values in each group are averaged. The median of these O(ln(1)) averages is reported. A combination of the Chebychev inequality and the Chernoff bounds can be used to show the following result:


Lemma 12.2.6 By selecting the median of O(ln(1)) averages of O(1/ 2) copies of Q2i, it is possible to guarantee the accuracy of the sketch-based second-moment approximation to within 1 ± with a probability of at least 1 − δ.


Proof: According to Lemma 12.2.5, the variance of each sketch component is at most 2·F22. By using the average of 16/ 2 independent sketch components, the variance of the averaged estimate can be reduced to F22 · 2/8. In this case, the Chebychev inequality shows that the -bound is violated by the averaged estimate with probability at most 1/8. Assume that a total of 4 · ln(1) such averaged and independent estimates are available. The random variable Y is defined as the sum of the Bernoulli indicator variables of -bound violations over these q = 4· ln(1 ) averages. The expected value of Y is q/8 = ln(1)/2. The Chernoff bound is used to show the following:





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   246   247   248   249   250   251   252   253   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin