Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	251/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 247 248 249 250 251 252 253 254 ... 423

1-Data Mining tarjima

P (Y > q/2) = P (Y > (1 + 3) · q/8) = P (Y > (1 + 3)E[Y ]) ≤ e⁻³²^·^ln⁽¹^/δ⁾^/⁸ = δ⁹^/⁸ ≤ δ.

The median can violate the -bound only when more than half the averages violate the bound. The probability of this event is exactly P (Y > q/2). Therefore, the median violates the -bound with probability at most δ.

The AMS sketch can be used to estimate many other values in a similar way, with corre-sponding quality bounds. For example, consider two streams with the sketch components Q_i and R_i.

The dot product between the frequency counts of the items in a pair of streams is estimated as the product of the corresponding sketch components Q_i and R_i. By using

408 CHAPTER 12. MINING DATA STREAMS

the median of O(ln(1/δ)) averages of diﬀerent sets of O(1/ 2) values of Q_i · R _i, it is possible to bound the approximation within 1 ± with probability at least 1 − δ. This estimation can be performed using the count-min sketch as well, though with a diﬀerent bound.

The Euclidean distance between the frequency counts of a pair of streams can be estimated as Q²_i + R_i² − 2Q_i · R_i. The Euclidean distance can be viewed as a linear combination of three diﬀerent dot products (including self-products) between the fre-quency counts of the two streams. Because each dot product is itself bounded using the “mean–median trick” discussed above, the approach can be used to determine similar quality bounds in this case as well.

Like the count-min sketch, the AMS sketch can be used to estimate frequency values. For the jth distinct stream element with frequency f_j , the product of the random variable r_j and Q_i provides an estimate of the frequency.

E[f_j ] = r_j · Q_i.

(12.29)

The mean, median, or mean–median combination of these values over diﬀerent sketch components Q_i can be reported as a robust estimate. The AMS sketch can also be used to identify heavy hitters from the data stream.

Some of the queries resolved by the AMS and count-min sketch are similar, although others are diﬀerent. The bounds provided by the two techniques are also diﬀerent, although none of them is strictly better than the other in all scenarios. The count-min sketch does have the advantage of being intuitively easy to interpret because of its natural hash-table data structure. As a result, it can be more naturally integrated in data mining applications such as clustering and classification in a seamless way.

12.2.2.4 Flajolet–Martin Algorithm for Distinct Element Counting

Sketches are designed for determining stream statistics that are dominated by large aggregate signals of frequent items. However, they are not optimized for estimating stream statistics that are dominated by infrequently occurring items. Problems such as distinct element counting are more directly influenced by the much larger number of infrequent items in a data stream. Distinct element counting can be performed eﬃciently with the Flajolet– Martin algorithm.

The Flajolet–Martin algorithm uses a hash function h(·) to render a mapping from a given element x in the data stream to an integer in the range [0, 2L − 1]. The value of L is selected to be large enough, so that 2L is an upper bound on the number of distinct elements. Usually, the value L is selected to be 64 for implementation convenience, and because the value of 264 is large enough for most practical applications. Therefore, the binary representation of the integer h(x) will have length L. The position4 R of the rightmost 1 bit of the binary representation of the integer h(x) is determined. Thus, the value of R represents the number of trailing zeros in this binary representation. Let R_max be the maximum value of R over all stream elements. The value of R_max can be maintained incrementally in the streaming scenario by applying the hash function to each incoming stream element, determining its rightmost bit, and then updating R_max. The key idea in the Flajolet–Martin algorithm is that the dynamically maintained value of R _max is logarithmically related to the number of distinct elements encountered so far in the stream.

The position of the least significant bit is 0, the next most significant bit is 1, and so on.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 247 248 249 250 251 252 253 254 ... 423