Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə249/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   245   246   247   248   249   250   251   252   ...   423
1-Data Mining tarjima

n




F2 =fi2

(12.25)

i=1




In the massive-domain scenario, where the number of distinct elements is large, this quantity is hard to estimate because running counts of the frequencies fi cannot be maintained with an array. However, it can be estimated effectively using the AMS sketch. As a practical application, the second -order moment yields an estimate of the self-join size of a data stream with respect to the massive-domain attribute. The second-order moment can also be viewed as a variant of the Gini index, which measures the level of frequency skew over different items in the data stream. When the skew is large, the value of F2 is large, and very close to its upper bound ( ni=1 fi)2.


The AMS sketch contains m different sketch components, each of which is associated with an independent hash function. Each hash function generates its corresponding sketch component as follows. A random binary value, with equal probability for both realizations, is generated for the incoming stream element. This binary value is denoted by r ∈ {−1, 1}, and is generated using the hash function for that component. The frequency of each incoming stream element is multiplied by r, and added to the corresponding component of the sketch. Let ri ∈ {−1, 1} be the random value generated by a particular hash function for the ith dis-tinct element. Then, the corresponding component Q of the sketch, for a stream of n distinct elements with aggregate frequencies f1 . . . fn, can be shown to be equal to the following:





n




Q =fi · ri.

(12.26)

i=1

This relationship is because of the incremental way in which Q is updated, each time a stream item is received. Note that the value of Q is a random variable, dependent on how the binary random values r1 . . . rn are generated by the hash function. The value of Q is useful in estimating the second moment.


Lemma 12.2.4 The second moment of the data stream can be estimated by the square of the AMS sketch component Q:



Proof: It is easy to see that Q2 = of hash values ri, rj , we have ri2 = rj2



F2 = E[Q2].




(12.27)




n

+ 2

n


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   245   246   247   248   249   250   251   252   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin